# Fundamentals of Data Analysis Project

<img src = "https://pandas.pydata.org/_static/pandas_logo.png" style="height:100px" align="right"/>
<img src = "https://seaborn.pydata.org/_static/scatterplot_matrix_thumb.png?v=0.9.0" style="width:150px" align="left"/>
<img src ="https://upload.wikimedia.org/wikipedia/commons/3/38/Jupyter_logo.svg" width="150" align="center"/>

<a id="overview"></a>
# Project Overview

This project concerns the well-known tips dataset and the Python packages **seaborn** and **jupyter**. 
The project is broken into three parts, as follows.

1. Description: Descriptive Statistics and plots to describe the tips dataset.
This sections provides a summary of the tips dataset using summary statistics and plots.

2. Regression: Is there a relationship between the total bill and tip amount?
This sections discusses and analyses the relationship, if any between the total bill amount and tip together with an explantion of the analysis.

3. Analyse: Look at relationship between the variables within the dataset.
Where section 2 looks at the relationship between total bill amount and the tip amount, this section investigate what relationships exist between all of the variables with interesting relationships highlighted and discussed. 
***

<a id="toc"></a>

# Table of contents
- [Project Overview](#overview)   
- [About this notebook](#notebook)
    - [Project plan](#plan)
    - [Jupyter Notebook](#notebook)
    - [Python Libraries](#libraries)
    - [Downloading and running the code](#running)
- [Part 1: Descriptive Statistics and plots to describe the tips dataset.](#part1)
    - [The tips dataset](#tipsdataset)  
    - [Loading / Reading in the dataset](#loading)    
    - [Exploring the dataset](#exploring)  
    - [Summary Statistics](#statistics)  
    - [Visualising the dataset using plots](#visualise)
- [Part 2 Regression: Discuss and analyse whether there is a relationship between the total bill and tip amount.](#part2)
- [Part 3 Analyse: Analyse the relationships between the variables within the dataset](#part3)
- [References](#references)  

***

# About this notebook

***
## project plan - remove later!
<a id="plan"></a>

### Project Instructions
As per the attached [Project Instructions]('Instructions.pdf'), this assessment concerns the tips dataset and the Python packages seaborn and jupyter. The project is broken into three parts, as follows.

(30%) **Description**: Create a git repository and make it available online for the lecturer to clone. The repository should contain all your work for this assessment. Within the repository, create a jupyter notebook that uses descriptive statistics and plots to describe the tips dataset. 
*marked based on Good summary of the dataset, repository well laid-out and organised. Reasonable commits to the repository.*

(30%) **Regression**: To the above jupyter notebook add a section that discusses and analyses whether there is a relationship between the total bill and tip amount.
*marked based on Good analysis of the relationship between total bill and tip, with good explanations of the analysis.*
(40%) **Analyse**: Again using the same notebook, analyse the relationship between the variables within the dataset. You are free to interpret this as you wish — for example, you may analyse all pairs of variables, or select a subset and analyse those. 
*marked based on Reasonable work investigating the relationship between the variables, with interesting relationships highlighted and discussed.*

### 1. Description.
- Descriptive Statistics using pandas functions such as `describe`
- seaborn plots to show boxplots which visualise the main statistics from the `describe` function such as mean, median, lower and upper quartiles and interquartile ranges. 
- look at other seaborn plots that summarise the dataset

### 2. Regression
- look at plots first to see what the trends are between total bill amount and tip amount.
- **discuss** and **analyse** whether there is a relationship.
- provide a **good analysis** of the this relationship backed up by good explanations of this analysis.

### 3. Analyse the relationships between the variables within the dataset
- should see from previous steps any obvious relationships and explore these in more detail.
- highlight interesting relationships between variables and discuss these in more detail. 

### 4. References
- Keep reference list up to date as I go through the project. Look into better ways to referencing rather than adding throughout the document using reference style links as per [Markdown Guide](https://www.markdownguide.org/basic-syntax/#reference-style-links) which is not this way!

### 5. Set up the relevant sections or parts before going any further!

### 6. Refer to project instructions pdf to ensure I am keeping on track and not going off on a tangent.

summarised above.

***

***
## About this notebook and python libraries used in it.
<a id="notebook"></a>

For this project I will be using the **NumPy**, **pandas**, **seaborn** and **matplotlib.pyplot** packages which are imported using the conventionally used aliases of **np** for **NumPy**, **pd** for **pandas**, **sn** for **seaborn** and the **matplotlib.pyplot** as **plt**.

[Seaborn](https://seaborn.pydata.org) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
[seaborn](https://seaborn.pydata.org/introduction.html#introduction) 
> Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures.


### Importing Python Libraries

In [1]:
## import libraries

# import libraries using common alias names
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#np.version.version  # check what version of packages are installed.
print("NumPy version",np.__version__, "pandas version ",pd.__version__, "seaborn version",sns.__version__  )  # '1.16.2'

#np.set_printoptions(formatter={'float': lambda x: "{0:6.3f}".format(x)})
np.set_printoptions(precision=4)  # set floating point precision to 4
np.set_printoptions(threshold=5) # summarise long arrays
np.set_printoptions(suppress=True) # to suppress small results

pd.options.display.max_rows=8 # set options to display max number of rows

#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"  #

NumPy version 1.16.2 pandas version  0.24.2 seaborn version 0.9.0


In [2]:
# check what version of numpy and other packages I have installed.
print(pd.__version__)
print(sns.__version__)

0.24.2
0.9.0


In [None]:

# To display all the output in each cell instead of just the statement, run these two lines
#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"  # 

In [None]:
# checking what files I have in this folder
!ls
!ls *.png
!ls *.csv

***
# Part 1: Description: Descriptive Statistics and plots to describe the tips dataset. 

The goal for part 1 is to provide a good summary of the tips dataset using statistics and plots.
I will start by reading in the Tips dataset from the csv file and looking at the resulting pandas DataFrame object. First a little overview of the Tips dataset and where it came from. I may refer to the dataset as 'Tips' or 'tips' throughout this document. 


<a id="tipsdataset"></a>
## The Tips dataset

The tips dataset is available in the [seaborn-data repository](https://github.com/mwaskom/seaborn-data) belonging to Michael Waskom - the creator of the [seaborn](https://seaborn.pydata.org/index.html) python data visualisation package. 
According to it's README document this repository exists only as a convenient target for the `seaborn.load_dataset` functions to download sample datasets from. The **tips** dataset is also built into the **seaborn** package and can be easily loaded using the seaborn `load_dataset` command. (`seaborn.load_dataset("tips")`)
It is one of several example datasets that are used in the documentation of the `seaborn` package to demonstrate the features and uses of the `seaborn` package.

The tips dataset is available in csv format at the following URL: <https://github.com/mwaskom/seaborn-data/blob/master/tips.csv>.

**[Seaborn](https://seaborn.pydata.org/introduction.html#an-introduction-to-seaborn)** is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures.
Many examples use the “tips” dataset, which is very boring but quite useful for demonstration. The tips dataset illustrates the “tidy” approach to organizing a dataset.
 
The [tips csv file](http://vincentarelbundock.github.io/Rdatasets/csv/reshape2/tips.csv) is also available at the [Rdatasets website](https://vincentarelbundock.github.io/Rdatasets/) which is a large collection of datasets originally distributed alongside the statistical software environment R and some of its add-on packages for teaching and statistical software development purposes maintained by [Vincent Arel-Bundock](http://arelbundock.com).

According to the [tips dataset documentation](http://vincentarelbundock.github.io/Rdatasets/doc/reshape2/tips.html), the **Tips** dataset is a data frame with 244 rows and 7 variables which represents some tipping data where one waiter recorded information about each tip he received over a period of a few months working in one restaurant. 
In all the waiter recorded 244 tips. The data was reported in a collection of case studies for business statistics (Bryant & Smith 1995).[1]

The waiter collected several variables:

### Variables

- tip in dollars  
- bill in dollars    
- sex of the bill payer  
- whether there were smokers in the party  
- day of the week  
- time of day  
- size of the party  



***
 

## Loading / Reading in the data file into Python
<a id="loading"></a>


#### About the Tips dataset.
The tips dataset is available as described above in csv format at the two urls : 
- Vincent Arel-Bundock's [Rdatasets website](https://vincentarelbundock.github.io/Rdatasets/) at <http://vincentarelbundock.github.io/Rdatasets/csv/reshape2/tips.csv> 
- The [seaborn-data repository](https://github.com/mwaskom/seaborn-data) at <https://github.com/mwaskom/seaborn-data/blob/master/tips.csv>. Here the csv data is actually displayed nicely to the screen in tabular format - to get a link for the raw csv file click the `raw` icon which dumps the raw csv file to the browser from where you can copy the url <https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv>.   
  
A CSV file is a file where the values are seperated by a comma `,` (comma separated values).

The Python `pandas` library has several functions for reading tabular data such as a csv file into a DataFrame object. See [pandas.pydata.org /read_csv](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table). 
Data that is in csv format can be read into a pandas **DataFrame** object either from a csv file or from a URL. A Pandas DataFrame is a 2 dimensional data structure with rows and columns that resembles a spreadsheet.

The `pandas.read_csv()` function performs type inferrence to infer the type of data types in each column. A DataFrame can have mixed data types such as numeric, integer, string, boolean etc but each column will have only one data type.

The Tips dataset is a small dataset so the entire csv file can be read into python in one go without causing any problems. 

In [8]:
import pandas as pd
csv_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv'
      # using the attribute information as the column names
#col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
df =  pd.read_csv(csv_url)  ## creata a DataFrame named df from reading in the csv file from a URL

In [9]:
csv_url2 = 'http://vincentarelbundock.github.io/Rdatasets/csv/reshape2/tips.csv'
      # using the attribute information as the column names
#col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']
df2 =  pd.read_csv(csv_url2)  ## creata a DataFrame named df from reading in the csv file from a URL

In [4]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [10]:
df2.head()

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,Female,No,Sun,Dinner,2
1,2,10.34,1.66,Male,No,Sun,Dinner,3
2,3,21.01,3.5,Male,No,Sun,Dinner,3
3,4,23.68,3.31,Male,No,Sun,Dinner,2
4,5,24.59,3.61,Female,No,Sun,Dinner,4


## 

***
<a id="exploring"></a>
## Exploring the tips dataset

Using pandas 
<img src = "https://pandas.pydata.org/_static/pandas_logo.png" alt="pandas-logo" style="width:200px"/>

and seaborn python packages.
<img src = "https://seaborn.pydata.org/_static/scatterplot_matrix_thumb.png?v=0.9.0" alt="seaborn" style="height:100px"/>

Having read in the dataset using the pandas `read_csv()` function I will look at the dataset using pandas and seaborn packages. These two packages work go hand in hand for analysing datasets. The pandas has many useful functions for looking at the dataframe object that is creating from reading in the csv file. The dataset can be sliced and diced to look at subsets of the dataset, to look at certain variables in the different columns or to select different categories of the variables or other combinations of rows and columns. Summary statistics can be easily generated with pandas. Seaborn package can then be used for creating some nice plots. You can also generate some plots just using pandas.

The following pandas functions can be used to get a good overview of the dataset.
- `df.head()`
- `df.tail()`
- `df.dtype()`
- `df.index()`
- `df.isna().sum()`

I will use pandas first to have a look at the tips dataset before going on to use seaborn and other packages.

<img src = "https://pandas.pydata.org/_static/pandas_logo.png" width="250" alt="pandas-logo"/>

In [None]:
df.head() # see the top 5 rows of the data
df.tail() # the last 5 rows of the data
df.index # the index of the DataFrame
df.isna().sum() # how many missing values or NA's
df.dtypes # the data types of each variable
df.columns # the columns in the dataframe
df.dtypes.value_counts()  # how many types of variables in the dataset

The csv file has been read in using `pandas.read_csv()`. 
The column names were assigned using the first line of data in the csv file. This is the default treatment with `pandas.read_csv()` if you have not set a header row or provided column names.

If you wanted to provide different column names you need to set `header=None` and then provide the names to use using the `names` parameter, for example `names= 'col-name1', 'col-name2'` etc.

A csv file may or may not have a header row that you can use as the names of the columns. If there is a header row, you can allow pandas to assign the columns names using the header row. Alternatively you can assign the column names yourself in one of a few ways, either by setting header=None or by providing a list of names to the read_csv function. names='col-name1', 'col-name2','col-name3' If you don't set a header row or provide the names yourself, then pandas will just treat the first line of the data as the names of the columns.


The top of the dataframe shows that there are 7 columns as expected and an index that begins at 0 for the first row. If the index of a DataFrame is not set to a particular column or some other value, it will default to a sequence of integers beginning at 0. I think this is fine for this dataset.  
The index has been set to a range of integers from 0 (for the first row) up to 243 for the last row or observation in the dataset.
You can set your own `index_col`  to the column numbers or names to use as the row index but I don't see any suitable column for an index. 

The bottom of the dataset shows no surprises either. The tail function is useful for making sure a dataset has been read in properly as any problems in reading in csv files usually throw out the end of the dataframe. Here I can see the same types of values in each column, the last index is 243 so this means there must be 244 rows of observations in the dataset. 

There are 244 observations in the dataset. The index is a range of integers from 0 up to but not including 244. 

Check to see if there are any missing values or NA's in the dataset. `isna()` returns a boolean value of True or False, these are then summed to give a count of the missing values across the different columns.

dtypes
Data types are inferred by the read_csv function. It is also possible to pass the data type when reading in the file.

The data types show that there are three numerical columns and 4 non-numerical object columns. 

The read_csv() function performs type inferrence when reading in the dataset from the csv file. total_bill and tip are floats while size is an integer. The remaining columns have been read in as objects. As far as I now object is used for strings or when there are mixed data types in a column. The time column here is not an actual time but instead just a categorical variable.
***

## Summary Statistics

Pandas `describe` can be used to give a good summary of the numerical variables of the dataframe including, the count, the mean, the standard deviation, the minimum and maximum values, the median etc.

`50%` is the median value which is where half the values are above this and half the values are below. The median is the average of the 2 middle values in the dataset taking in order of magnitude.
`25%` is similar in that it shows the value of which 25% of the values are below this.  It is like taking the median of the bottom half of the dataset if ordered by magnitude. 
`75%` is similar idea - like the median of the top half of the dataset. 
The mean value is the average value in the dataset but it may not be typical of the values in the dataset, as it is could be the average over very small or very large values. 
The median is more like a typical value in the dataset or closer to some of the typical values. 
Look to see if the mean and median are similar or are much different from each other.
If the median and mean are similar then the dataset is probably more balanced. 

`df.mean()`

df.loc[df.loc[:,'sex']=='Female']


In [None]:
df.describe()

Furthermore, you can take summary statistics of subsections of the dataset.

For this dataset I could look at the statistics by sex and see if males or females pay similar type of tips or not, by day of the week to see if tips vary much from one day to another or from a weekday to the weekend or by time of the day. Whether a smoker or non-smoker is more inclined to leave a larger or a smaller tip.
Part 2 question looks at the relationship between tip amount and total bill amount so I will leave that for now.



### accessing data from the dataframe for calculating summary statistics on.

some useful ones are to sort, filter, boolean selection etc

Sort the dataframe to see the type of values at the top and bottom of the dataset. Can sort in ascending order to see lowest values of a variable or combination of variable such as by tip amount

In [None]:
df.sort_values(by='tip').head(10)

In [None]:
df.sort_values(by='tip', ascending = False).head()

In [None]:
df.loc[:, 'sex'].describe()

Selecting using boolean values and then getting summary statistics.
Can see the summary statistics by male or by females.
by smoker or non-smoker.

In [None]:
df[df.loc[:, 'sex'] == 'Male'].describe()

In [None]:
df[df.loc[:, 'smoker'] == 'Yes'].describe()

In [None]:
df[df.loc[:, 'smoker'] == 'No'].describe()

In [None]:
df[df.time=='Dinner'].describe()

In [None]:
df[df.loc[:, 'sex'] == 'Female'].describe()

Using **groupby** to get counts by group
There are three categorical variables: sex, smoker and day

In [None]:
df.groupby("day").count()

In [None]:
df.groupby(["sex","smoker"]).count()

In [None]:
df.groupby(["day","sex","smoker"]).count()

Here experimenting with adding new columns to the dataframe based on other columns using the `np.where` function.

https://stackoverflow.com/questions/36603018/pandas-multiple-conditions-based-on-multiple-columns-using-np-where/36603238



In [None]:
df['mean'] = np.where(df['tip']<=2 , 'yes', 'no')

In [None]:

df['weekend'] = np.where(((df.day == 'Sun') | (df.day == 'Sat') | (df.day == 'Fri')), 'weekend', 'weekday')

df['mean'] = np.where(((df.tip <3) & ((df.smoker == 'yes'))), 'a mean smoker', 'not a mean smoker')

In [None]:
df.head()

In [None]:
df.groupby(["mean"]).count()

In [None]:
df.head()

In [None]:
df.head()

<a id="visualising"></a>
## Visualing the tips dataset

#### Seaborn plots of the tips dataset
Having used pandas functions above to select subsets of the dataset, the **seaborn** package can be used to create some nicer visualisations than the basic plots in pandas and matplotlib. The **seaborn** library works with with **pandas**. It is built on top of **matplotlib** and closely integrated with **pandas** data structures.

Whatever way the plots are created, they can be used to verify the numbers from the summary statistics. 
The plots will visualise any obvious relationships between the variables and also if there are any groups of observations that are clearly seperate to other groups of observations. 
 
The `pairplot` function in seaborn show scatter plots of the variables against each other. A kernel density function or histogram is displayed down the diagonal. 




Example plot of tips dataset as used in the official Introduction to seaborn guide using the built-in tips dataset.

Using one of the examples in the seaborn documentation to draw a faceted scatter plot with multiple semantic variables

tips = sns.load_dataset("tips")
sns.relplot(x="total_bill", y="tip", col="time",
            hue="smoker", style="smoker", size="size",
            data=tips)
This plot shows the relationship between five variables in the tips dataset. The two numeric variables (total_bill and tip) determined the position of each point on the axes, and the third (size) determined the size of each point. One categorical variable split the dataset onto two different axes (facets), and the other determined the color and shape of each point.

This plot required only a single call to the seaborn function relplot(). Seaborn

aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots."

In [None]:
import seaborn as sns
sns.pairplot(df, hue = "time")

In [None]:
sns.pairplot(df, hue = "sex")

In [None]:
%matplotlib inline

In [None]:
df.plot(subplots=True, figsize=(12,6))

### what do the plots show at a glance?

##### More male than female bill-payers. 

In [None]:
sns.countplot("sex",  data =df)



##### More non-smokers than smokers

In [None]:
sns.countplot("smoker",  data =df, hue ="sex")

##### significantly more male billpayers on saturday and sunday. 
On Thursdays and Fridays nearly the same number of male and female bill-payers.
There does not seem to be any data for Mondays, Tuesdays and Wednesdays

In [None]:
sns.countplot("day",  data =df, hue ="sex")

##### far more non-smokers on sundays and thursdays but more smokers on Friday's
Maybe after work!

In [None]:
sns.countplot("day",  data =df, hue="smoker")

In [None]:
sns.relplot(x="total_bill", y="tip", col="time",
            hue="smoker", style="smoker", size="size",
            data=df)

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker",
            kind="swarm", data=df)

In [None]:
sns.boxplot(x="day", y="tip", data=df)

#### histograms and kernel density function of tip and total_bill

In [None]:
sns.distplot(df['tip'])

In [None]:
sns.distplot(df['total_bill'])

In [None]:
sns.distplot(df['size'])

***
# Part 2 Regression: Discuss and analyse whether there is a relationship between the total bill and tip amount.

***
# Part 3 Analysis: Analyse the relationship between the variables within the dataset.
Analyse all pairs of variables or just a subset of variables.

***

In [None]:
!ls

***
<a id="references"></a>
# References

### The Tips dataset
- [1] Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing.

# References
- **[Python.org](https://www.python.org/)**  
- **[GitHub guides - Mastering Markdown](https://guides.github.com/features/mastering-markdown/)**  
- **[Project Jupyter](https://jupyter.org/)**  
- **[seaborn.pydata.org](https://seaborn.pydata.org/)**  
- **[tips dataset on Michael Waskon GitHub](https://github.com/mwaskom/seaborn-data/blob/master/tips.csv.)**  
- **[The R Datasets](http://vincentarelbundock.github.io/Rdatasets/datasets.html)** - including the tips dataset
- **[seaborn](https://seaborn.pydata.org/introduction.html#introduction)**  
- **[ipython magic commands ](https://ipython.readthedocs.io/en/stable/interactive/magics.html)**  

### Python, GitHub and Jupyter resources
- [python.org](https://docs.python.org/3/library/index.html)
- Python for Data Analysis - Chapter 4 NumPy Basics: Arrays and Vectorised Computation by Wes McKinney
- [Python Data Science Handbook by Jake VanderPlas ](https://jakevdp.github.io/PythonDataScienceHandbook/) 
- [Jake VanderPlas Website](http://vanderplas.com)
- [numpy quickstart tutorial](https://numpy.org/devdocs/user/quickstart.html)


- [GitHub Flavoured Markdown](https://github.github.com/gfm/)
- [Jupyter Notebook documentation](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html#basic-workflow)
- [Jupyter Notebook Tips, Tricks, and Shortcuts](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/)
- [python random docs](https://docs.python.org/3/library/random.html#module-random)
-[LaTeX equations in Jupyter](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html#LaTeX-equations)
- [pythonprogramming.net](https://pythonprogramming.net)

- Jupyter logo.  Cameron Oelsen [BSD (http://opensource.org/licenses/bsd-license.php)]
[toc](#toc)
***

### Learning some markdown and magic commands to run in the Jupyter notebook 


### learning latex... 

see cheat sheet at [www.nyu.edu](https://www.nyu.edu/projects/beber/files/Chang_LaTeX_sheet.pdf)
or [wch.github.io](https://wch.github.io/latexsheet/latexsheet-0.png)


$e^{i\pi} + 1 = 0$
$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

$\sigma$
$$\sigma$$
$\mu$

x~  $N(\mu,\sigma^2)$

### images
to resize the image, I need to use html instead of markdown and use a css style