# Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* 1) Univariate Plots
    * 1. distplot
    * 2. kdeplot
* 2) Bivariate Plots
    * 1. jointplot
    * 2. pairplot

## Imports

In [2]:
# Import Plotly Express for data visualization
import plotly.express as px


## Explore Plotly's Built-in Datasets
Plotly provides several built-in datasets that we can use for practice.
Let's explore the available datasets in Plotly.
[Check here](https://plotly.com/python-api-reference/generated/plotly.data.html) 

In [2]:
# Display the list of available built-in datasets in Plotly
px.data.__all__

['carshare',
 'election',
 'election_geojson',
 'experiment',
 'gapminder',
 'iris',
 'medals_wide',
 'medals_long',
 'stocks',
 'tips',
 'wind']

## Load Tips Dataset
Load the 'tips' dataset, which contains information about tips received by waitstaff at a restaurant.
This dataset will be used for visualizations.

In [3]:
# Load the tips dataset into a DataFrame
tips=px.data.tips()

In [5]:
# Display the first few rows of the dataset to understand its structure
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## Get Basic Information about the Data
Check the structure of the dataset to see the columns, their data types, and non-null counts.


In [6]:
# Get information on data types and null values for each column
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


## Get Descriptive Statistics
Use the .describe() method to generate summary statistics for the numerical columns.

In [7]:
# Summary statistics for numerical data
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [8]:
# Display the first few rows of the dataset
tips.describe(include="O")

Unnamed: 0,sex,smoker,day,time
count,244,244,244,244
unique,2,2,4,2
top,Male,No,Sat,Dinner
freq,157,151,87,176


## Analyze Data for Specific Groups
Let's calculate the average of numerical columns for male smokers.
This will give us insight into whether male smokers spend more or tip differently.


In [9]:
mean_values = tips[['total_bill', 'tip', 'size']].mean()
print(mean_values)

total_bill    19.785943
tip            2.998279
size           2.569672
dtype: float64


In [16]:
tips[(tips["sex"]=="Male") & (tips["smoker"]=="Yes")].mean(numeric_only=True)

total_bill    22.284500
tip            3.051167
size           2.500000
dtype: float64

## Check Value Distribution for Specific Columns
Get the value counts for the 'time' column, which shows whether the meal is lunch or dinner.

In [17]:
# Count occurrences of each unique value in the 'time' column
tips["time"].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

## Check Unique Values for 'size' Column
Find out the unique values for the 'size' column, which represents the number of people in the group.


In [18]:
# Get the unique values in the 'size' column
tips["size"].unique()

array([2, 3, 4, 1, 6, 5], dtype=int64)

## 1) Univariate Plots
Univariate plots are used to visualize a single variable. 
We will focus on:
* Histogram
* Kernel Density Estimate (KDE) plot

### 1. Histogram
A histogram is used to visualize the distribution of a single variable. 
Let's visualize the distribution of total bill amounts.


In [19]:
# Plot a basic histogram of the 'total_bill' column
px.histogram(tips["total_bill"])

In [20]:
# Uncomment the line below to explore additional options for the px.histogram function
help(px.histogram)

Help on function histogram in module plotly.express._chart_types:

histogram(data_frame=None, x=None, y=None, color=None, pattern_shape=None, facet_row=None, facet_col=None, facet_col_wrap=0, facet_row_spacing=None, facet_col_spacing=None, hover_name=None, hover_data=None, animation_frame=None, animation_group=None, category_orders=None, labels=None, color_discrete_sequence=None, color_discrete_map=None, pattern_shape_sequence=None, pattern_shape_map=None, marginal=None, opacity=None, orientation=None, barmode='relative', barnorm=None, histnorm=None, log_x=False, log_y=False, range_x=None, range_y=None, histfunc=None, cumulative=None, nbins=None, text_auto=False, title=None, template=None, width=None, height=None) -> plotly.graph_objs._figure.Figure
        In a histogram, rows of `data_frame` are grouped together into a
        rectangular mark to visualize the 1D distribution of an aggregate
        function `histfunc` (e.g. the count or sum) of the value `y` (or `x` if
        `orie

## Enhanced Histogram with Hover Data
Let's add more information to the histogram by displaying extra data on hover.
Here, we will visualize the 'total_bill' and display all other columns as hover data.

In [23]:
# Histogram with hover data
px.histogram(data_frame=tips ,x="total_bill",hover_data=tips.columns)

## Colored Histogram by 'smoker' Column
Let's add color to the histogram based on whether the person was a smoker or not. This will help us analyze differences in behavior.

In [24]:
# Histogram colored by smoker status
px.histogram(data_frame=tips ,x="total_bill",
             hover_data=tips.columns
             ,color="smoker")

## Customizing Colors in Histogram
You can change the color palette to make the plot visually distinct.
Here, we'll assign custom colors to represent smokers and non-smokers.

In [25]:
px.histogram(data_frame=tips ,x="total_bill",
             hover_data=tips.columns
             ,color="smoker",
             color_discrete_sequence=["green","red"])

In [26]:
# Set the number of bins to 20 to control how finely the data is divided
px.histogram(data_frame=tips ,x="total_bill",
             hover_data=tips.columns
             ,color="smoker",
             color_discrete_sequence=["green","red"],
             nbins=20)

In [27]:
# Create a histogram to visualize the distribution of the 'tip' column.
# Include hover data to show more information when hovering over the bars.
px.histogram(data_frame=tips ,
             x="tip",
             hover_data=tips.columns)

### Histogram for Categorical Data
In addition to numerical data, you can also create histograms for categorical data.
This will show the frequency of each category.

In [28]:
# Creating a histogram of the 'day' column, which is categorical.
px.histogram(data_frame=tips,x="day")

### Customizing Histogram Color
You can change the color of the histogram bars using the `color_discrete_sequence` parameter.


In [32]:
# Change the color of the histogram bars to green
px.histogram(data_frame=tips,x="day"
            ,color_discrete_sequence=["green"])

### Facet Histogram
Use faceting to create multiple subplots based on categorical columns.
This helps to visualize how data is distributed across different categories.

In [36]:
# Create a faceted histogram where 'total_bill' is plotted against 'tip', and the facets are divided by 'day'.
# Color the bars by 'sex' and add a marginal box plot for more detailed analysis.
fig=px.histogram(data_frame=tips,
                 x="total_bill",
                 y="tip",
                 color="sex",
                 facet_col="day",
                 marginal="box")

fig.show()

## 2. KDE Plot (Kernel Density Estimation)

kdeplots estimate the probability density function of a continuous variable. [Kernel Density Estimation plots](http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth). 
It gives a smooth curve that represents the distribution of data points.
These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:
Let's create KDE plots for the 'total_bill' column using Plotly.

In [44]:
import plotly.figure_factory as ff

# Convert 'total_bill' to a list for input to the KDE plot function
hist_data=tips.total_bill.to_list()
# print(hist_data)

# Create a KDE plot with bin size of 2 and show a rug plot to visualize individual data points
ff.create_distplot([hist_data],
                   ["total_bill"],
                   bin_size=2,
                   show_rug=True,
                   show_hist=True,
                   show_curve=True
                   )

### Customizing the KDE Plot
We can customize the KDE plot by hiding the rug plot, histogram, or curve, and changing colors.


In [45]:
# KDE plot without rug plot
ff.create_distplot([hist_data],
                   ["total_bill"],
                   bin_size=2,
                   show_rug=False,
                   show_hist=True,
                   show_curve=True
                   )


In [46]:
# KDE plot without the curve (just the histogram)
ff.create_distplot([hist_data],
                   ["total_bill"],
                   bin_size=2,
                   show_rug=False,
                   show_hist=True,
                   show_curve=False
                   )

In [47]:
# KDE plot without the histogram (just the curve)
ff.create_distplot([hist_data],
                   ["total_bill"],
                   bin_size=2,
                   show_rug=False,
                   show_hist=False,
                   show_curve=True
                   )

In [52]:
# Customizing the color of the KDE plot (changing the curve color to green)
ff.create_distplot([hist_data],
                   ["total_bill"],
                   bin_size=2,
                   show_rug=False,
                   show_hist=False,
                   show_curve=True,
                   colors=["green"]
                   )

In [56]:
# Customizing axis labels in the KDE plot
fig=ff.create_distplot([hist_data],
                   ["total_bill"],
                   bin_size=2,
                   show_rug=False,
                   show_hist=False,
                   show_curve=True,
                   colors=["green"]
                   )
fig.update_xaxes(title_text="total bill")
fig.update_yaxes(title_text="frequency")
fig.update_layout(title_text="total bill distribution")
fig.show()





## 2) Bivariate Plots
Bivariate plots visualize the relationship between two variables. 
* 1. scatter
* 2. scatter_matrix
* 3. line

Let's explore scatter plots and more.

### 1. Scatter Plot
Scatter plots show the relationship between two continuous variables by plotting data points.
Let's plot 'total_bill' vs 'tip' to see how they relate.

In [None]:
# Simple scatter plot of 'total_bill' vs 'tip'
px.scatter(data_frame=tips,
           x="total_bill"
           ,y="tip",
           title="total bill vs tip ")

In [60]:
# Add color to distinguish between smokers and non-smokers
px.scatter(data_frame=tips,
           x="total_bill"
           ,y="tip",
           color="smoker",
           title="total bill vs tip ")

In [62]:
# Add histograms on both axes to visualize the distribution of 'total_bill' and 'tip'
px.scatter(data_frame=tips,
           x="total_bill"
           ,y="tip",
            color="smoker",
            marginal_x="histogram",
            marginal_y="histogram",
           title="total bill vs tip ")

### Customizing Scatter Plot
You can change the color scheme and plot type for marginal distributions to make the plot more visually distinct.


In [None]:
# Customizing scatter plot with violin plot on x-axis and box plot on y-axis, and using a green color scheme
px.scatter(data_frame=tips,
           x="total_bill"
           ,y="tip",
            color="smoker",
            marginal_x="violin",
            marginal_y="box",
            color_discrete_sequence=["green","red"],
           title="total bill vs tip ")


In [66]:
# Creating a scatter plot with 'total_bill' on the x-axis and 'tip' on the y-axis.
# Marginal plots are added: violin plot on the x-axis and box plot on the y-axis.
# Custom colors are applied: green for smokers ('Yes') and dark red for non-smokers ('No').
px.scatter(data_frame=tips,
           x="total_bill"
           ,y="tip",
            color="smoker",
            marginal_x="violin",
            marginal_y="box",
            color_discrete_map={"Yes":"red","No":"green"},
           title="total bill vs tip ")

In [68]:
# Adding size as a variable to the scatter plot.
# The size of the data points is determined by the 'size' column in the dataset.
px.scatter(data_frame=tips,
           x="total_bill"
           ,y="tip",
            color="smoker",
            size="size",
            marginal_x="violin",
            marginal_y="box",
            color_discrete_sequence=["green","red"],
           title="total bill vs tip ")

## 2) Scatter Matrix

A scatter matrix visualizes pairwise relationships between multiple numerical columns and can supports a color hue argument (for categorical columns). 

In [69]:
# Create a scatter matrix for all numerical columns in the 'tips' dataset.
px.scatter_matrix(tips)

### Customizing Scatter Matrix
We can adjust the size and choose specific columns for the scatter matrix.

In [70]:
# Change the dimensions of the scatter matrix (width and height)
px.scatter_matrix(tips,width=800,height=1000)


Choose custom columns

In [71]:
# Create a scatter matrix for selected columns: 'total_bill', 'tip', and 'size'
px.scatter_matrix(tips,dimensions=["total_bill","tip","size"],width=800,height=1000)


### Adding Hue to the Scatter Matrix
We can use color to differentiate based on a categorical variable like 'smoker'.

In [None]:
# Use 'smoker' column to color the scatter matrix. Smokers will have different color from non-smokers.
px.scatter_matrix(tips,dimensions=["total_bill","tip","size"],width=800,height=1000)


In [74]:
# Adjust the size of the scatter matrix with the color hue based on 'smoker' status.
px.scatter_matrix(tips,dimensions=["total_bill","tip","size"],width=800,height=1000
                  ,color="smoker",size="size"
                  )


In [75]:
# Add a title to the scatter matrix for clarity.
px.scatter_matrix(tips,dimensions=["total_bill","tip","size"],
                  width=800,
                  height=1000
                  ,color="smoker",
                  size="size",
                  title="scatter matrix between total_bill tip"
                  )

## 3) Line Plots
Line plots help visualize trends between two continuous variables. 
In this section, we explore line plots between 'total_bill' and 'tip'.

In [76]:
# Simple line plot between 'total_bill' and 'tip'.
px.line(data_frame=tips,
        x="total_bill",
        y="tip")

### Sorting Data for Line Plot
Line plots can be more meaningful if the data is sorted. 
Let's sort the data based on 'total_bill' to see a clearer trend.

In [77]:
# Sort the data by 'total_bill' and then create the line plot
df=tips.sort_values(by="total_bill")
px.line(data_frame=df,x="total_bill",
        y="tip")

### Adding Color to Line Plot
We can differentiate the line plot based on smoker status by using different colors for smokers and non-smokers.


In [78]:
# Line plot with color distinction for smokers ('Yes') and non-smokers ('No')
px.line(data_frame=df,x="total_bill",
        y="tip",color="smoker")

### Customizing Line Plot
You can change the dimensions, title, and color scheme of the line plot to make it more visually appealing.

In [85]:
# Customize the line plot by adjusting width, height, and title, and using a custom color sequence (green)
fig=px.line(data_frame=df,x="total_bill",
        y="tip",
        width=1400,
        height=500,
        title="Tips variation per smoker and non_smoker for total bill ")
fig.show()

In [86]:
fig.write_html("Tips variation per smoker and non_smoker for total bill.html")

## Great Work!
You've successfully explored various plot types using Plotly. Keep practicing to become more familiar with customization options and creating compelling visualizations.
