<img align="left" src="images/GMIT-logo.png" alt="GMIT" width="250"/>                                                      <img align="right" src="images/data-analytics.png" alt="HDipDA" width="300"/>  

# <center>Fundamentals of Data Analysis - Tips Project 2019</center> #

***
**Module Name**: Fundamentals of Data Analysis  
**Module Number**: 52446  
**Student Name**: Yvonne Brady  
**Student ID**: G00376355  
***

_**Note:** For ease of navigation, where markdown cells are not adjacent, there is a link to the next markdown cell ">>>". This will aid in anyone who wants to see the analysis rather than the code to achieve that decisional information. In some instances there is data results or plot displayed in code cells. These will not be skipped over in this manner._

### Description - 30%###
Create a git repository and make it available online for the lecturer to clone. The repository should contain all your work for this assessment. Within the repository, create a jupyter notebook that uses descriptive statistics and plots to describe the tips dataset. This part is worth 30% of your overall mark.

### Regression - 30% ###
To the above jupyter notebook add a section that discusses and analyses whether there is a relationship between the total bill and tip amount, and this part is also worth 30%.

### Analysis - 40% ###
Again using the same notebook, analyse the relationship between the variables within the dataset. You are free to interpret this as you wish — for example, you may analyse all pairs of variables, or select a subset and analyse those. This part is worth 40%.

## Table of Contents
1. [Introduction](#intro)  
2. [General Description of Dataset](#info)  
3. [Categorical Data](#cat)  
4. []()  
5. []()

## <a name="intro"></a>1.0 Introduction
This project involves the tips dataset that comes as part of the seaborn package. 

<span style='background :yellow' > 
**_Caveat:_** There are no boundaries set for this dataset however which leads to a flawed analysis, see below for details. From what we can ascertain from google searching, the data appears to come from one waiter who recorded information about each tip he received over a period of a few months working in one restaurant. It is on this basis that the analysis was done.  </span>

All caveats aside, in order to start our analysis, we must first import our packages. [>>>](#info)

In [1]:
# First of all import all the packages you need
# The importation of a lot of packages may have adverse affects on the performance of your script
# but this is not important for this dataset and investigation. 
# Should enhanced performance be required, the importation packages may be re-thought. 

import numpy as np # foundation of all data processing packages
import pandas as pd # using dataframes etc
import matplotlib.pyplot as plt # plotting and as a basis for seaborn
import seaborn as sns # fancier plotting and statistics etc
import pandas_profiling # found this - for profiling the dataset initially
from scipy import stats # For statistics
from tabulate import tabulate # To make som etables a bit easier
import sklearn.neighbors as nei
import sklearn.model_selection as mod
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn import metrics

# magic command to allow for easier integration of matplotlib plots in the Jupyter notebook
%matplotlib inline 

In [2]:
# And the dataset itself (also included in this repository)
tips = sns.load_dataset("tips")

In [3]:
# Now have a look at the data
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null category
time          244 non-null category
size          244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB


In [4]:
print("The dataset has",tips.shape[0], "rows, each with", tips.shape[1], "attributes - totalling", tips.size, "data values in the dataset.")
print("Over the data collection period, the waiter served", tips['size'].sum(), "customers in",tips.shape[0], "transactions, generating an income of $", tips["total_bill"].sum(), ". ")
print("This resulted in tips totalling $", tips["tip"].sum().round(2), ".")
print("The day breakdown is as follows:")
print(tips.groupby("day").size())

The dataset has 244 rows, each with 7 attributes - totalling 1708 data values in the dataset.
Over the data collection period, the waiter served 627 customers in 244 transactions, generating an income of $ 4827.77 . 
This resulted in tips totalling $ 731.58 .
The day breakdown is as follows:
day
Thur    62
Fri     19
Sat     87
Sun     76
dtype: int64


In [5]:
# Now have a look at the data - first the initial few rows to see what they look like
tips.head(5) # returns first 5 rows

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [6]:
# Then the last few rows as a check to ensure the data has not gone awry in the middle somewhere
tips.tail() # returns last 5 rows

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


In [7]:
# And finally a random sample to see what else there is in there
tips.sample(5) # returns 5 randomly selected rows

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
96,27.28,4.0,Male,Yes,Fri,Dinner,2
25,17.81,2.34,Male,No,Sat,Dinner,4
129,22.82,2.18,Male,No,Thur,Lunch,3
71,17.07,3.0,Female,No,Sat,Dinner,3
198,13.0,2.0,Female,Yes,Thur,Lunch,2


In [8]:
tips.isnull().any()

total_bill    False
tip           False
sex           False
smoker        False
day           False
time          False
size          False
dtype: bool

In [9]:
# See what values are in the categorical data
print("Gender Categories:",tips["sex"].unique())
print("Smoker Categories:",tips["smoker"].unique())
print("Day Categories:",tips["day"].unique())
print("Time Categories:",tips["time"].unique())
print("Party Size Categories:",tips["size"].unique())

Gender Categories: [Female, Male]
Categories (2, object): [Female, Male]
Smoker Categories: [No, Yes]
Categories (2, object): [No, Yes]
Day Categories: [Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]
Time Categories: [Dinner, Lunch]
Categories (2, object): [Dinner, Lunch]
Party Size Categories: [2 3 4 1 6 5]


In [10]:
# Given that the tip is what we are really looking for it makes sense to include it in the dataframe
tips["tipPC"] = 100*tips["tip"]/tips["total_bill"]
# The describe function gives general statistics on the dataset. 
# Using the include = "all" means we also get the category data too
tips.describe(include="all")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tipPC
count,244.0,244.0,244,244,244,244,244.0,244.0
unique,,,2,2,4,2,,
top,,,Male,No,Sat,Dinner,,
freq,,,157,151,87,176,,
mean,19.785943,2.998279,,,,,2.569672,16.080258
std,8.902412,1.383638,,,,,0.9511,6.10722
min,3.07,1.0,,,,,1.0,3.563814
25%,13.3475,2.0,,,,,2.0,12.912736
50%,17.795,2.9,,,,,2.0,15.476977
75%,24.1275,3.5625,,,,,3.0,19.147549


## <a name="info"></a>2. General Description of the Dataset  
The dataset itself has 244 rows, each with 7 attributes equating to 1708 data values in total. The recorded attributes were:
- _*total_bill*_ : Total Bill Amount assumed to be US\\$. This is a floating point value.
- _*tip*_ : Tip amount assumed to be US\\$. This is a floating point value.
- _*sex*_ : Gender of bill payer. This is categorical data with values of either male or female.
- _*smoker*_ : Whether they were a smoker or not. This is categorical data with values of either yes or no.
- _*day*_ : What day of the week the transaction occurred. This is categorical data with values spanning between Thurs and Sun.
- _*time*_ : Meal the diners were being served. This is categorical data with values of either lunch or dinner.  

The first, last and random sampling of rows are intact and there no missing values which leads us to believe the dataset is of good quality.  

Over the data collection period, the waiter served 627 customers in 244 transactions, generating an income of \\$4827.77 for the restaurant and \\$731.58 in tips.  

Each individual total bill value ranged from \\$3.07 to \\$50.81 with the mean being \\$19.79. The tips ranged from \\$1 to \\$10 with the mean being \\$3.  

Males were the most frequent bill payers and non-smokers were more prevalent than smokers. Saturday was the most popular dining day with dinner being the most served meal in the dataset.
[>>>}()

In [11]:
# Create dataframes based on the categories
dinner = tips.loc[tips["time"] == "Dinner"]
lunch = tips.loc[tips["time"] == "Lunch"]
smoker = tips.loc[tips["smoker"] == "Yes"]
nonsmoker = tips.loc[tips["smoker"] == "No"]
male = tips.loc[tips["sex"] == "Male"]
female = tips.loc[tips["sex"] == "Female"]
thurs = tips.loc[tips["day"] == "Thur"]
fri = tips.loc[tips["day"] == "Fri"]
sat = tips.loc[tips["day"] == "Sat"]
sun = tips.loc[tips["day"] == "Sun"]
size1 = tips.loc[tips["size"] == 1]
size2 = tips.loc[tips["size"] == 2]
size3 = tips.loc[tips["size"] == 3]
size4 = tips.loc[tips["size"] == 4]
size5 = tips.loc[tips["size"] == 5]
size6 = tips.loc[tips["size"] == 6]

In [35]:
# Create a table to compare data
dfs =[dinner, lunch, smoker, nonsmoker, male, female, thurs, fri, sat, sun, size1, size2, size3, size4, size5, size6]
cats = ["Meal", "Meal", "Smoking Status", "Smoking Status","Gender", "Gender", "Day", "Day", "Day", "Day", "Party Size",  "Party Size", "Party Size", "Party Size", "Party Size", "Party Size"]
cls = ["Dinner", "Lunch", "Smoker", "Nonsmoker", "Male", "Female", "Thurs", "Fri", "Sat", "Sun", "1", "2", "3", "4", "5", "6"]
totIncome = tips["total_bill"].sum()
totTips = tips["tip"].sum()
totPeople = tips["size"].sum()

table = []
for i in range(0,len(dfs)):
    row = [cats[i], cls[i], dfs[i]["total_bill"].count()]# The number of rows involved
    row.append((100*dfs[i]["total_bill"].count()/244).round(2)) # Transactions as % of total
    row.append(dfs[i]["size"].sum()) # Seeing the number of people this represents
    row.append((100*dfs[i]["size"].sum()/totPeople).round(2)) # # People as % of total
    row.append(dfs[i]["total_bill"].sum().round(2)) # Total bill for this group
    row.append((100*dfs[i]["total_bill"].sum()/totIncome).round(2)) # % of total income
    row.append((dfs[i]["total_bill"].sum()/dfs[i]["size"].sum()).round(2)) # Total bill per person for this group
    row.append(dfs[i]["tip"].sum().round(2)) # Total tips for this group
    row.append((100*dfs[i]["tip"].sum()/totTips).round(2)) # Tips as % of total tips
    row.append((dfs[i]["tip"].sum()/dfs[i]["size"].sum()).round(2))# Total tips per person for this group
    row.append(((dfs[i]["tip"].sum()/dfs[i]["total_bill"].sum())*100).round(2))# % Tips based on total bill and total tips received for this group
    table.append(row)

# Header column
hdr = ['Category', 'Group', 'Table Count', '% Tables', 'People', '% People', 'Total Bills ($)', '% of Total Bills', 'Bill pp ($)', 'Total Tips ($)', '% Total Tips', 'Tips pp ($)', 'Total Tip %']

summarydf = pd.DataFrame(table, columns = hdr)
summarydf

Unnamed: 0,Category,Group,Table Count,% Tables,People,% People,Total Bills ($),% of Total Bills,Bill pp ($),Total Tips ($),% Total Tips,Tips pp ($),Total Tip %
0,Meal,Dinner,176,72.13,463,73.84,3660.3,75.82,7.91,546.07,74.64,1.18,14.92
1,Meal,Lunch,68,27.87,164,26.16,1167.47,24.18,7.12,185.51,25.36,1.13,15.89
2,Smoking Status,Smoker,93,38.11,224,35.73,1930.34,39.98,8.62,279.81,38.25,1.25,14.5
3,Smoking Status,Nonsmoker,151,61.89,403,64.27,2897.43,60.02,7.19,451.77,61.75,1.12,15.59
4,Gender,Male,157,64.34,413,65.87,3256.82,67.46,7.89,485.07,66.3,1.17,14.89
5,Gender,Female,87,35.66,214,34.13,1570.95,32.54,7.34,246.51,33.7,1.15,15.69
6,Day,Thurs,62,25.41,152,24.24,1096.33,22.71,7.21,171.83,23.49,1.13,15.67
7,Day,Fri,19,7.79,40,6.38,325.88,6.75,8.15,51.96,7.1,1.3,15.94
8,Day,Sat,87,35.66,219,34.93,1778.4,36.84,8.12,260.4,35.59,1.19,14.64
9,Day,Sun,76,31.15,216,34.45,1627.16,33.7,7.53,247.39,33.82,1.15,15.2


## <a name="cat"></a> 3. Categorical Data
Looking at the categories as a whole we can make a number of important observations.  

### <a name = "meal"></a>3.1 Meal
We can see that 72.13% of all tables served were during a dinner sitting. This equated to 73.84% of the people served. The importance of dinner is further emphasised when you consider 7582% off all income was through dinners.

From a tips perspective, dinner accounted for 74.64% of the total tips received. This however equated to t

**General Observation**  
Based on the rudimentary calculations above, it would appear the best tip as a % of total bill is to be found if the waiter serves lunch on a Friday to a non-smoking woman dining on her own. Mind you this particular scenario accounted for exactly $0 in tips, so I guess the conclusion of this part would be - don't rely on generalities!
The most tips however would be obtained by serving a party of 2, non-smokers having dinner on a Saturday where the bill is being paid by a male.

In [None]:
(tips.groupby(["day", "time", "sex", "smoker", "size"])["tip"].sum() 
   .sort_values(ascending=False) 
   .reset_index(name='Total Tips') ).head()

In [None]:
# Just checking which was the most common day and time
(tips.groupby(["day", "time"]).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   )

In [None]:
# Just checking which was the most common scenario - Sunday dinner with two non-smoking males apparently
(tips.groupby(["day", "time", "sex", "smoker", "size"]).size() 
   .sort_values(ascending=False) 
   .reset_index(name='count') 
   ).head()

In [None]:
tips.describe()

In [None]:
tips.groupby("day").describe()

In [None]:
# Found this and thought it was worth trying ...
profile = pandas_profiling.ProfileReport(tips)
profile

This is a really interesting summary of the data - with lots of information, both numeric and graphical contained within it.

## Start Plotting ##
First off we will look at the variables individually.  
  
_Note: This is preliminary - I will almost certainly not include everything here in final submission._

In [None]:
sns.distplot(tips['total_bill'],kde=True,bins=30)

Looking at the histogram displayed the bulk of the bills fell between the \\$10 to \\$25 bracket.

In [None]:
sns.distplot(tips['tip'],kde=True,bins=30)

And the bulk of the tips were between \\$1 and \\$3.50, with more outliers than in the bill amounts. Unlike the bill amounts though the outliers are more prevalent on the upward range.

In [None]:
sns.distplot(tips['size'],kde=True,bins=30)

By far the most prevalent party size was 2 people.

In [None]:
sns.distplot(tips["tipPC"],kde=True,bins=30)

As we can see from the plot above, most of the tips (as % of the total bill) fall between 10 - 20%, peaking at around the 15%.

In [None]:
plt.hist((tips['tip']+tips["total_bill"])%100, 50)

In [None]:
sns.boxplot(x='day',y='total_bill',data=tips,palette='rainbow')

In [None]:
sns.boxplot(x="day", y="total_bill", hue="sex", data=tips, palette="PRGn")
sns.despine(offset=10, trim=True)

In [None]:
# Plotting a simple Jointplot:
#sns.pairplot(data=tips, hue='sex', palette='icefire', x_vars=['total_bill','size'], y_vars=['tip'], size=6, aspect=.85, kind='reg')
sns.pairplot(data=tips, x_vars=['day'], y_vars=['total_bill','tip'], size=6, aspect=.85, kind='scatter')

## Plotting the Categorical Data ##

In [None]:
sns.catplot(x="day", y="total_bill", hue="sex",
            kind="violin", inner="stick", split=True,
            palette="pastel", data=tips);

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker",
            kind="violin", inner="stick", split=True,
            palette="pastel", data=tips);

In [None]:
sns.catplot(x="day", y="total_bill", hue="time",
            kind="violin", inner="stick", split=True,
            palette="pastel", data=tips);

## Bi-Variate Plotting ##

In [None]:
sns.pairplot(tips)

In [None]:
sns.pairplot(tips, hue = "time")

In [None]:
sns.pairplot(tips, hue = "sex")

In [None]:
sns.pairplot(tips, hue = "smoker")

In [None]:
sns.pairplot(tips, hue = "day")

In [None]:
sns.pairplot(tips, hue = "size")

In [None]:
# Get the regression line using all the data
t_slope, t_intercept, t_r2, t_p, t_stdErr = stats.linregress(tips["total_bill"], tips["tip"])

In [None]:
# See how good a fit it is
t_r2

In [None]:
# Plot both the raw data and the "best fit" line
plt.plot(tips["total_bill"], tips["tip"], 'o', label='original data')
plt.plot(tips["total_bill"], t_intercept + t_slope*tips["total_bill"], 'r', label='fitted line')
plt.legend()
plt.show()

In [None]:
# Another view on the data 
sns.lmplot(x="size", y="tipPC", data=tips)

In [None]:
# Time of day comparison
# Get the regression line using all the data
d_slope, d_intercept, d_r2, d_p, d_stdErr = stats.linregress(dinner["total_bill"], dinner["tip"])
l_slope, l_intercept, l_r2, l_p, l_stdErr = stats.linregress(lunch["total_bill"], lunch["tip"])
print("r^2 dinner (Dataset size", len(dinner.index), "rows) = ", d_r2)
print("r^2 lunch (Dataset size", len(lunch.index), "rows) = ", l_r2)

In [None]:
# Check if the time values are statistically different
print("T-Test Results")
print("Total Bill results:", stats.ttest_ind(dinner['total_bill'], lunch['total_bill']))
print("Tip results:", stats.ttest_ind(dinner['tip'], lunch['tip']))
print("Party Size results:", stats.ttest_ind(dinner['size'], lunch['size']))
print("% Tip of Total Bill results:", stats.ttest_ind(dinner['tipPC'], lunch['tipPC']))

In [None]:
plt.rcParams['figure.figsize'] = [16, 6]

# Plot both the raw data and the "best fit" lines
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.plot(dinner["total_bill"], dinner["tip"], 'ko', label='original dinner data')
ax1.plot(dinner["total_bill"], t_intercept + t_slope*dinner["total_bill"], 'r', label='fitted dinner line')
ax1.legend()
ax2.plot(lunch["total_bill"], lunch["tip"], 'go', label='original lunch data')
ax2.plot(lunch["total_bill"], t_intercept + t_slope*lunch["total_bill"], 'b', label='fitted lunch line')
ax2.legend()

# Set labels
ax1.set_xlabel('Total Bill Amount ($)')
ax2.set_xlabel('Total Bill Amount ($)')
ax1.set_ylabel('Tip Amount ($)')
fig.suptitle('Time of Day Comparison', fontsize=18)
ax1.set_title('Dinner',fontsize=14)
ax2.set_title('Lunch',fontsize=14)
plt.show()

In [None]:
# Smoker comparison
# Get the regression line using all the data
s_slope, s_intercept, s_r2, s_p, s_stdErr = stats.linregress(smoker["total_bill"], smoker["tip"])
ns_slope, ns_intercept, ns_r2, ns_p, ns_stdErr = stats.linregress(nonsmoker["total_bill"], nonsmoker["tip"])
print("r^2 smokers (Dataset size", len(smoker.index), "rows) = ", s_r2)
print("r^2 non-smokers (Dataset size", len(nonsmoker.index), "rows) = ", ns_r2)

In [None]:
# Check if the smoking status values are statistically different
print("T-Test Results")
print("Total results:", stats.ttest_ind(smoker['total_bill'], nonsmoker['total_bill']))
print("Tip results:", stats.ttest_ind(smoker['tip'], nonsmoker['tip']))
print("Party Size results:", stats.ttest_ind(smoker['size'], nonsmoker['size']))
print("% Tip of Total Bill results:", stats.ttest_ind(smoker['tipPC'], nonsmoker['tipPC']))

In [None]:
plt.rcParams['figure.figsize'] = [16, 6]
#fig=plt.figure(figsize=(5, 10),  facecolor='w', edgecolor='k')
# Plot both the raw data and the "best fit" lines
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.plot(smoker["total_bill"], smoker["tip"], 'ko', label='original smoker data')
ax1.plot(smoker["total_bill"], s_intercept + s_slope*smoker["total_bill"], 'r', label='fitted smoker line')
ax1.legend()
ax2.plot(nonsmoker["total_bill"], nonsmoker["tip"], 'go', label='original non-smoker data')
ax2.plot(nonsmoker["total_bill"], ns_intercept + ns_slope*nonsmoker["total_bill"], 'b', label='fitted non-smoker line')
ax2.legend()

# Set labels
ax1.set_xlabel('Total Bill Amount ($)')
ax2.set_xlabel('Total Bill Amount ($)')
ax1.set_ylabel('Tip Amount ($)')
fig.suptitle('Smoker / Non-Smoker Comparison', fontsize=18)
ax1.set_title('Smoker',fontsize=14)
ax2.set_title('Non-Smoker',fontsize=14)
plt.show()

In [None]:
# Gender comparison
# Get the regression line using all the data
m_slope, m_intercept, m_r2, m_p, m_stdErr = stats.linregress(male["total_bill"], male["tip"])
f_slope, f_intercept, f_r2, f_p, f_stdErr = stats.linregress(female["total_bill"], female["tip"])
print("r^2 male (Dataset size", len(male.index), "rows) = ", m_r2)
print("r^2 female (Dataset size", len(female.index), "rows) = ", f_r2)

In [None]:
plt.rcParams['figure.figsize'] = [16, 6]
#fig=plt.figure(figsize=(5, 10),  facecolor='w', edgecolor='k')
# Plot both the raw data and the "best fit" lines
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.plot(male["total_bill"], male["tip"], 'ko', label='original male data')
ax1.plot(male["total_bill"], m_intercept + m_slope*male["total_bill"], 'r', label='fitted male line')
ax1.legend()
ax2.plot(female["total_bill"], female["tip"], 'go', label='original female data')
ax2.plot(female["total_bill"], f_intercept + f_slope*female["total_bill"], 'b', label='fitted female line')
ax2.legend()

# Set labels
ax1.set_xlabel('Total Bill Amount ($)')
ax2.set_xlabel('Total Bill Amount ($)')
ax1.set_ylabel('Tip Amount ($)')
fig.suptitle('Gender Comparison', fontsize=18)
ax1.set_title('Male',fontsize=14)
ax2.set_title('Female',fontsize=14)
plt.show()

In [None]:
# Check if the gender values are statistically different
print("T-Test Results")
print("Total Bill results:", stats.ttest_ind(male['total_bill'], female['total_bill']))
print("Tip results:", stats.ttest_ind(male['tip'], female['tip']))
print("Party Size results:", stats.ttest_ind(male['size'], female['size']))
print("% Tip of Total Bill results:", stats.ttest_ind(male['tipPC'], female['tipPC']))

In [None]:
# Party Size comparison
# Get the regression line using all the data
s1_slope, s1_intercept, s1_r2, s1_p, s1_stdErr = stats.linregress(size1["total_bill"], size1["tip"])
s2_slope, s2_intercept, s2_r2, s2_p, s2_stdErr = stats.linregress(size2["total_bill"], size2["tip"])
s3_slope, s3_intercept, s3_r2, s3_p, s3_stdErr = stats.linregress(size3["total_bill"], size3["tip"])
s4_slope, s4_intercept, s4_r2, s4_p, s4_stdErr = stats.linregress(size4["total_bill"], size4["tip"])
s5_slope, s5_intercept, s5_r2, s5_p, s5_stdErr = stats.linregress(size5["total_bill"], size5["tip"])
s6_slope, s6_intercept, s6_r2, s6_p, s6_stdErr = stats.linregress(size6["total_bill"], size6["tip"])
print("r^2 Party Size 1 (Dataset size", len(size1.index), "rows) = ", s1_r2)
print("r^2 Party Size 2 (Dataset size", len(size2.index), "rows) = ", s2_r2)
print("r^2 Party Size 3 (Dataset size", len(size3.index), "rows) = ", s3_r2)
print("r^2 Party Size 4 (Dataset size", len(size4.index), "rows) = ", s4_r2)
print("r^2 Party Size 5 (Dataset size", len(size5.index), "rows) = ", s5_r2)
print("r^2 Party Size 6 (Dataset size", len(size6.index), "rows) = ", s6_r2)

In [None]:
plt.rcParams['figure.figsize'] = [16,16]

# Plot both the raw data and the "best fit" lines
fig, ((ax1, ax2), (ax3, ax4), ( ax5, ax6)) = plt.subplots(3, 2, sharey=True)

ax1.plot(size1["total_bill"], size1["tip"], 'ko', label='original 1 person data')
ax1.plot(size1["total_bill"], s1_intercept + s1_slope*size1["total_bill"], 'r', label='fitted 1 person line')
ax1.legend()
ax2.plot(size2["total_bill"], size2["tip"], 'go', label='original 2 people data')
ax2.plot(size2["total_bill"], s2_intercept + s2_slope*size2["total_bill"], 'b', label='fitted 2 people line')
ax2.legend()
ax3.plot(size3["total_bill"], size3["tip"], 'ko', label='original 3 people data')
ax3.plot(size3["total_bill"], s3_intercept + s3_slope*size3["total_bill"], 'r', label='fitted 3 people line')
ax3.legend()
ax4.plot(size4["total_bill"], size4["tip"], 'go', label='original 4 people data')
ax4.plot(size4["total_bill"], s4_intercept + s4_slope*size4["total_bill"], 'b', label='fitted 4 people line')
ax4.legend()
ax5.plot(size5["total_bill"], size5["tip"], 'ko', label='original 5 people data')
ax5.plot(size5["total_bill"], s5_intercept + s5_slope*size5["total_bill"], 'r', label='fitted 5 people line')
ax5.legend()
ax6.plot(size6["total_bill"], size6["tip"], 'go', label='original 6 people data')
ax6.plot(size6["total_bill"], s6_intercept + s6_slope*size6["total_bill"], 'b', label='fitted 6 people line')
ax6.legend()

# Set labels
ax1.set_xlabel('Total Bill Amount ($)')
ax2.set_xlabel('Total Bill Amount ($)')
ax3.set_xlabel('Total Bill Amount ($)')
ax4.set_xlabel('Total Bill Amount ($)')
ax5.set_xlabel('Total Bill Amount ($)')
ax6.set_xlabel('Total Bill Amount ($)')
ax1.set_ylabel('Tip Amount ($)')
ax3.set_ylabel('Tip Amount ($)')
ax5.set_ylabel('Tip Amount ($)')
fig.suptitle('Size of Party Comparison', fontsize=18)
ax1.set_title('1 Person',fontsize=14)
ax2.set_title('2 People',fontsize=14)
ax3.set_title('3 People',fontsize=14)
ax4.set_title('4 People',fontsize=14)
ax5.set_title('5 People',fontsize=14)
ax6.set_title('6 People',fontsize=14)
plt.show()

In [None]:
# Check if the time values are statistically different
print("ANOVA Results")
print("Total Bill Results", stats.f_oneway(size1['total_bill'], size2['total_bill'], size3['total_bill'], size4['total_bill'], size5['total_bill'], size6['total_bill']))
print("Tip Results", stats.f_oneway(size1['tip'], size2['tip'], size3['tip'], size4['tip'], size5['tip'], size6['tip']))
print("% Tip of Total Bill Results", stats.f_oneway(size1['tipPC'], size2['tipPC'], size3['tipPC'], size4['tipPC'], size5['tipPC'], size6['tipPC']))

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=2)
sns.jointplot(x = 'size', y = 'total_bill', data = tips, kind = 'reg', ax = ax1)
sns.jointplot(x = 'size', y = 'tip', data = tips, kind = 'reg', ax = ax2)

In [None]:
table = [["Smokers", len(smoker.index), s_slope, s_intercept, s_r2, s_p, s_stdErr], \
         ["Non-Smokers", len(nonsmoker.index), ns_slope, ns_intercept, ns_r2, ns_p, ns_stdErr], \
         ["Male", len(male.index), m_slope, m_intercept, m_r2, m_p, m_stdErr], \
         ["Female", len(female.index), f_slope, f_intercept, f_r2, f_p, f_stdErr], \
         ["Lunch", len(lunch.index), l_slope, l_intercept, l_r2, l_p, l_stdErr], \
         ["Dinner", len(dinner.index), d_slope, d_intercept, d_r2, d_p, d_stdErr], \
         ["Party of 1", len(size1.index), s1_slope, s1_intercept, s1_r2, s1_p, s1_stdErr], \
         ["Party of 2", len(size2.index), s2_slope, s2_intercept, s2_r2, s2_p, s2_stdErr], \
         ["Party of 3", len(size3.index), s3_slope, s3_intercept, s3_r2, s3_p, s3_stdErr], \
         ["Party of 4", len(size4.index), s4_slope, s4_intercept, s4_r2, s4_p, s4_stdErr], \
         ["Party of 5", len(size5.index), s5_slope, s5_intercept, s5_r2, s5_p, s5_stdErr], \
         ["Party of 6", len(size6.index), s6_slope, s6_intercept, s6_r2, s6_p, s6_stdErr]]
hdr = ["Category", "Sample Size", "Slope", "Intercept", "R^2 Value", "P-Value", "Std Error"]
print(tabulate(table, headers = hdr,  tablefmt="grid"))

In [None]:
# Size comparison
# Get the regression line using all the data
bs_slope, bs_intercept, bs_r2, bs_p, bs_stdErr = stats.linregress(tips["total_bill"], tips["size"])
ts_slope, ts_intercept, ts_r2, ts_p, ts_stdErr = stats.linregress(tips["tip"], tips["size"])
print("r^2 Size vs Total Bill = ", bs_r2)
print("r^2 Size vs Tip = ", ts_r2)

In [None]:
plt.rcParams['figure.figsize'] = [16, 6]
#fig=plt.figure(figsize=(5, 10),  facecolor='w', edgecolor='k')
# Plot both the raw data and the "best fit" lines
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True)

ax1.plot(tips["total_bill"], tips["size"], 'ko', label='original size / bill data')
ax1.plot(tips["total_bill"], bs_intercept + bs_slope*tips["total_bill"], 'r', label='fitted size / bill line')
ax1.legend()
ax2.plot(tips["tip"], tips["size"], 'go', label='original size / tip data')
ax2.plot(tips["tip"], ts_intercept + ts_slope*tips["tip"], 'b', label='fitted size / tip line')
ax2.legend()

# Set labels
ax1.set_xlabel('Total Bill Amount ($)')
ax2.set_xlabel('Tip Amount ($)')
ax1.set_ylabel('Party Size')
fig.suptitle('Influence of Party Size', fontsize=18)
ax1.set_title('Total Bill',fontsize=14)
ax2.set_title('Tip',fontsize=14)
plt.show()

Looking at how the party size influences both the total bill and the tip - it is clear that the tip amount does not have as good correlation with the party size as the total bill amount does. 

In [None]:
# Size comparison
# Get the regression line using all the data
rs_slope, rs_intercept, rs_r2, rs_p, rs_stdErr = stats.linregress(tips["tipPC"], tips["size"])
print("r^2 Size vs % Tip = ", rs_r2)

plt.rcParams['figure.figsize'] = [16, 6]
#fig=plt.figure(figsize=(5, 10),  facecolor='w', edgecolor='k')
# Plot both the raw data and the "best fit" lines

plt.plot(tips["tipPC"], tips["size"], 'ko', label='original size / % data')
plt.plot(tips["tipPC"], rs_intercept + rs_slope*tips["tipPC"], 'r', label='fitted size / % line')
plt.legend()

# Set labels
plt.xlabel('% Tip of Total Bill')
plt.ylabel('Party Size')
plt.title('Influence of Party Size',fontsize=14)
plt.show()

In [None]:
# Divide the dataset into inputs and outputs
# Inputs = the data we know
# Output = what we are looking for
inputs_tb = tips[['total_bill']]
outputs_tb = tips['tip']
# Build the neural network
model_tb = Sequential()
model_tb.add(Dense(25, input_dim=inputs_tb.shape[1], activation='relu')) # Hidden 1
model_tb.add(Dense(10, activation='relu')) # Hidden 2
model_tb.add(Dense(1)) # Output
model_tb.compile(loss='mean_squared_error', optimizer='adam')
model_tb.fit(inputs_tb,outputs_tb,verbose=2,epochs=100)

In [None]:
pred_tb = model_tb.predict(inputs_tb)
print("Shape: {}".format(pred_tb.shape))
print(pred_tb)

In [None]:
# Measure RMSE error.  RMSE is common for regression.
score_tb = np.sqrt(metrics.mean_squared_error(pred_tb,outputs_tb))
print(f"Final score (RMSE): {score_tb}")

In [None]:
# Sample predictions
for i in range(10):
    #    print(f"{i+1}. Car name: {cars[i]}, MPG: {y[i]}, predicted MPG: {pred[i]}")
    print(f"{i+1}. Total Bill: {inputs_tb['total_bill'][i]}, Actual Tip: {outputs_tb[i]}, predicted tip: {pred_tb[i]}")

In [None]:
# Divide the dataset into inputs and outputs
# Inputs = the data we know
# Output = what we are looking for
inputs_tbs = tips[['total_bill', 'size']]
outputs_tbs = tips['tip']
# Build the neural network
model_tbs = Sequential()
model_tbs.add(Dense(25, input_dim=inputs_tbs.shape[1], activation='relu')) # Hidden 1
model_tbs.add(Dense(10, activation='relu')) # Hidden 2
model_tbs.add(Dense(1)) # Output
model_tbs.compile(loss='mean_squared_error', optimizer='adam')
model_tbs.fit(inputs_tbs,outputs_tbs,verbose=2,epochs=100)

In [None]:
pred_tbs = model_tbs.predict(inputs_tbs)
print("Shape: {}".format(pred_tbs.shape))
print(pred_tbs)

In [None]:
# Measure RMSE error.  RMSE is common for regression.
score_tbs = np.sqrt(metrics.mean_squared_error(pred_tbs,outputs_tbs))
print(f"Final score (RMSE): {score_tbs}")

In [None]:
# Sample predictions
for i in range(10):
    print(f"{i+1}. Total Bill: {inputs_tbs['total_bill'][i]}, Size: {inputs_tbs['size'][i]} Actual Tip: {outputs_tbs[i]}, predicted tip: {pred_tbs[i]}")

In [None]:
# Sample predictions
for i in range(10):
    print(f"{i+1}. Total Bill: {inputs_tbs['total_bill'][i]}, Size: {inputs_tbs['size'][i]} Actual Tip: {outputs_tbs[i]}, predicted tip: {pred_tb[i]}, predicted tip(Size): {pred_tbs[i]}")

In [None]:
plt.plot(inputs_tb['total_bill'], outputs_tb, "g-")
plt.plot(inputs_tb['total_bill'], pred_tb, "r-")


In [None]:
# Sample predictions
var_tbs=[]
var_tb=[]
for i in range(len(outputs_tbs)):
    var_tbs.append(outputs_tbs[i] - pred_tbs[i])
    var_tb.append(outputs_tb[i] - pred_tb[i])
plt.plot(var_tbs, "g-")
plt.plot(var_tb, "r-")


In [None]:
table = [["Mean", np.mean(var_tb), np.mean(var_tbs)], \
         ["Minimum", np.min(var_tb), np.min(var_tbs)], \
         ["Maximum", np.max(var_tb), np.max(var_tbs)], \
         ["Standard Devation", np.std(var_tb), np.std(var_tbs)]]
hdr = ["Category", "Total Bill Only", "Total Bill & Size"]
print(tabulate(table, headers = hdr,  tablefmt="grid"))

Data is most meaningful when it is collected for a purpose. In that way you can state the conditions under which it is collected, what decisional information is to be garnered from the data etc. The data in this dataset is not gathered under such conditions and therefore any analysis will be tempered by the lack of understanding of its collection. 

The analysis itself may be taken from a number of different perspectives and tailored to suit the requirements. In the absence of such governing principles, it has been decided to see what we can infer from the existing data based on a number of assumptions. These assumptions will be stated from each perspective below. To start we will begin with the general analysis.

## General Analysis
**Assumptions:** None  


### References ###
[1] [Seaborn: Tips Dataset](https://github.com/mwaskom/seaborn-data/blob/master/tips.csv).  
[2] [rdrr.io: Reggression Class - Tips](https://rdrr.io/cran/regclass/man/TIPS.html)  
[3] [Scipy Docs: scipy.stats.linregress](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html)  
[4] [Jeff Heaton Introduction to Tensorflow](https://www.youtube.com/watch?v=PsE73jk55cE&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN)  
[5] [Real Python: Linear Regression in Python](https://realpython.com/linear-regression-in-python/)  
[6] [Stack Overflow: Grouped Boxplot with Seaborn](https://stackoverflow.com/questions/39344167/grouped-boxplot-with-seaborn)  
[7] [Seaborn Tutorial: Regression](https://seaborn.pydata.org/tutorial/regression.html)  
[8] [Towards Data Science: Data Visualisation](https://towardsdatascience.com/data-visualization-a6dccf643fbb)  
[9] []()  