# Final Project

**==========================================================================================================**

**==========================================================================================================**

## Project Description / Business Task

For this guided final project, you will put your newly acquired knowledge and skills to the test by completing this World Happiness comprehensive data visualization project from start to finish. The project aims to demonstrate your proficiency in Tableau, showcasing your ability to connect to data sources, import, and clean data, and create interactive visualizations, insightful stories, and interactive dashboards. It is crucial to adhere to best practices in data visualization, storytelling, and report/dashboard design throughout the project. You can share your project through the shareable link to your dashboard to showcase your expertise on Tableau Public with your peers.

## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
| country |	The name of the country |
| region |	The geographic region or continent |
| happiness_score |	A measure reflecting overall happiness |
| gdp_per_capita |	The extent to which GDP contributes to the calculation of the Happiness Score  |
| social_support |	A metric measuring social support |
| healthy_life_expectancy |	The extent to which life expectancy contributes to the calculation of the Happiness Score |
| freedom_to_make_life_choices |	The extent to which freedom contributes to the calculation of the Happiness Score |
| generosity |	The extent to which generosity contributes to the calculation of the Happiness Score |
| year|	|

**==========================================================================================================**

## Import Libraries

In [None]:
import numpy as np
#from numpy import count_nonzero, median, mean
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import random

import datetime
from datetime import datetime, timedelta, date

# Use Folium library to plot values on a map.
import folium
from geopy.geocoders import Nominatim

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)
#sns.set(rc={'figure.figsize':(14,10)})

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

**==========================================================================================================**

## Import Data

In [None]:
df = pd.read_csv("WHR_15_23.csv")

In [None]:
df.info()

In [None]:
df.dtypes.value_counts()

In [None]:
# Descriptive Statistical Analysis
df.describe(include="all")

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.fillna(0, inplace=True)

In [None]:
df.duplicated().sum()

**==========================================================================================================**

# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the crucial process of using summary statistics and graphical representations to perform preliminary investigations on data to uncover patterns, detect anomalies, test hypotheses, and verify assumptions.

**==========================================================================================================**

## Task 1 - Happiness status overview
Create a highlighted table representing the average happiness score for each region with a relevant title and that works as a filter for the dashboard

In [None]:
happy = pd.DataFrame(df.groupby(["region"], as_index=True)["happiness_score"].mean())
happy

In [None]:
countryhappy = pd.DataFrame(df.groupby(["country"], as_index=False)["happiness_score"].mean())
countryhappy

In [None]:
# Sort the dataset based on happiness score
sorted_data = countryhappy.sort_values(by='happiness_score', ascending=False)
sorted_data

In [None]:
# Get the top 10 happiest countries
top_10 = sorted_data.head(10)
top_10

## Create a map to display the 10 most happy countries.

In [None]:
# Initialize geocoder
geolocator = Nominatim(user_agent="happiness_map")

In [None]:
# Create a map centered on the world
m = folium.Map(location=[0, 0], zoom_start=2)
m;

In [None]:
# Add markers for the top 10 happiest countries
for index, row in top_10.iterrows():
    country = row['country']
    happiness_score = row['happiness_score']
    
    # Use geocoding to get latitude and longitude
    location = geolocator.geocode(country)
    
      
    # Create marker
    marker = folium.Marker(location=[lat, lon], popup=f"{country}: {happiness_score}")
        
    # Add marker to map
    marker.add_to(m)


In [None]:
plot(kind = "bar", figsize = (12,5), fontsize = 12)
plt.show()

In [None]:
df.groupby(['paymenttype1'], as_index=False)["duration"].mean()

In [None]:
plot(kind = "barh", figsize = (12,5), fontsize = 12)
plt.show()

In [None]:
payment2 = df.groupby(['paymenttype2'], as_index=False)["duration"].mean()
payment2

In [None]:
df.groupby(["pulocationid237"])['passengercount', 'tripdistance', 'fareamount', 'tipamount'].mean()

In [None]:
df.groupby(["pulocationid237"])['passengercount', 'tripdistance', 'fareamount', 'tipamount'].mean(). \
plot(color=['red','blue','black','green'], fontsize=10.0, figsize=(10,5))
plt.show()

In [None]:
df.groupby(["dolocationid237"])['passengercount', 'tripdistance', 'fareamount', 'tipamount'].mean()

In [None]:
df.groupby(['tpeppickupdatetimehour0', 'tpeppickupdatetimehour1', 'tpeppickupdatetimehour2'], as_index=False)\
['passengercount', 'tripdistance', 'fareamount', 'tipamount'].mean()

In [None]:
df.groupby(['tpepdropoffdatetimehour0', 'tpepdropoffdatetimehour1', 'tpepdropoffdatetimehour2'], as_index=False)\
['passengercount', 'tripdistance', 'fareamount', 'tipamount'].mean()

In [None]:
df.duration.describe()

In [None]:
plt.hist(x=df.duration, bins=20, range=(0,2500))
plt.show()

In [None]:
bins = [0, 400, 800, 1200, 1600, 2000, 2400]

In [None]:
cuts = pd.cut(x=df.duration, bins=bins, include_lowest=True)

In [None]:
df["durationgroup"] = cuts

In [None]:
df.head()

In [None]:
df.groupby(["durationgroup"]).mean()

In [None]:
df.groupby(['pulocationid237', 'pulocationid161', 'pulocationid236', 'pulocationid186', 'pulocationid162'], as_index=False) \
['passengercount', 'tripdistance', 'fareamount', 'tipamount'].mean()

### Different Aggregates for Different Columns

In [None]:
df_ts = df.groupby("date", as_index=False).mean()
df_ts

In [None]:
df.groupby(["merchantstate"]).agg({"age": "min", "passengercount": "sum"})

In [None]:
df.groupby(["paymenttype1","paymenttype2"], as_index=False).agg(
    {"fareamount": [np.min,np.median,np.max], "tipamount": [np.min,np.median,np.max]})

In [None]:
df.groupby(['tpeppickupdatetimehour0', 'tpeppickupdatetimehour1', 'tpeppickupdatetimehour2'], as_index=False)\
['fareamount', 'tipamount'].agg([np.mean, np.median])

In [None]:
df.groupby(['tpepdropoffdatetimehour0', 'tpepdropoffdatetimehour1', 'tpepdropoffdatetimehour2'], as_index=False)\
['fareamount', 'tipamount'].agg([np.mean, np.median, np.min, np.max])

**==========================================================================================================**

## MultiIndexing

In [None]:
df.head()

In [None]:
df = df.set_index(["merchantstate","date"])
df = df.sort_index()
df

In [None]:
df.amount.unstack()

**==========================================================================================================**

## Pivot Tables

<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.</p>

In [None]:
trnx = df.groupby(["date",'txndescription'], as_index=False).mean()
trnx

In [None]:
df_pivot = trnx.pivot(index='date', columns='txndescription', values="amount")
df_pivot

In [None]:
df.pivot_table(index="date", values="amount")

In [None]:
df.pivot_table(index="merchantstate", values="balance", aggfunc=[np.mean,np.median,np.sum,np.max])

In [None]:
df.pivot_table(index="date", columns="merchantstate", values="age")

In [None]:
df.pivot_table(index="date", columns="merchantstate", values="age", margins=True, margins_name="Sum")

In [None]:
df.pivot_table(index="date", columns=["merchantstate"], values=["amount","balance"], aggfunc=[np.mean], margins=True, 
               margins_name="Sum")

In [None]:
df.pivot_table(index="date", columns="merchantstate", values="amount", 
              aggfunc="count", dropna=False, fill_value=0)

## Crosstab

In [None]:
pd.crosstab(index=df["date"], columns=df["merchantstate"])

In [None]:
100 * pd.crosstab(index=df["date"], columns=df["txndescription"], margins=True, normalize=True)

## Melting

In [None]:
df_melt = df.pivot_table(index="date", columns="merchantstate", values="amount", 
              aggfunc="count", dropna=False, fill_value=0)

In [None]:
df_melt

In [None]:
df_melt.melt()

In [None]:
df_melt.reset_index()

In [None]:
df_melt.reset_index().melt()

In [None]:
df_melt2 = df_melt.reset_index().melt(id_vars="date", value_vars=None, var_name=None, value_name="Count")

In [None]:
df_melt2

**==========================================================================================================**

## Data Visualization

### Bar charts

As a first step, you will use **bar charts** to examine the frequency distributions of categorical variables. A bar chart displays frequencies of each category. In most cases, the categories should be **ordered by frequency**; ascending or descending. Ordering categories by frequency aids in viewer interpretation.

### Histograms

**Histograms** are related to bar plots. Whereas, a bar plot shows the counts of unique categories, a histogram shows the **number of data values within a bin** for a **numeric variable**. The bins divide the values of the variable into equal segments. The vertical axis of the histogram shows the count of data values within each bin.

### Kernel density plots

**Kernel density estimation** or **kde** plots are similar in concept to a histogram. A kernel density plot displays the values of a smoothed density curve of the data values. In other words, the kernel density plot is a smoothed version of a histogram.

### Combine histograms and kdes

Combining a histogram and a kde can highlight different aspects of a distribution. This is easy to do with Seaborn, as the code below demonstrates. In this case, the number of bins for the histogram has been increased from 10 to 20. 

## Two dimensional plots

Having used summary statistics and several one dimensional plot methods to explore data, you will continue this exploration using **two dimensional plots**. Two dimensional plots help you develop an understanding  of the **relationship between two variables**. For machine learning, the relationship of greatest interest is between the **features** and the **label**. It can also be useful to examine the relationships between features to determine if the features are co-variate or not. Such a procedure can prove more reliable than simply computing correlation when the relationship is nonlinear. 

### Scatter Plots

Scatter plots are widely used to examine the relationship between two variables.

Fortunately, there are several good ways to deal with over plotting:
1. Use **transparency** of the points to allow the view to see though points. With mild over plotting this approach can be quite effective.
2. **Contour plots** or **2d density plots** show the density of points, much as a topographic map shows elevation. Generating the contours has high computational complexity and making this method unsuitable for massive datasets.
3. **Hexbin plots** are the two-dimensional analog of a histogram. The density of the shading in the hexagonal cells indicates the density of points. Generating hexbins is computationally efficient and can be applied to massive datasets.

### Relation between categorical and numeric variables

You have created 2d plots of numeric variables But, what can you do if some of the features are categorical variables? There are two plot types specifically intended for this situation:
1. **Box plots** which highlight the quartiles of a distribution. Not surprisingly, the box plot contains a box. The range of the **inner two quartiles** are contained within the box. The lenght of the box shows the **interquartile range**. A line within the box shows the median. **Wiskers** extend for the maximum of 1.5 times the interquartile range or the extreme value of the data. Outliers beyond the wiskers are shown in a symbol. 
2. **Violin plots** which are a variation on the 1d KDE plot. Two back to back KDE curves are used to show the density estimate. 

Box plots of violin plots can be arranged side by side with data of the numerical variable grouped by the categories of the categorical variable. In this way each box or violin display represents the value of the numeric variable for cases of each category of the categorical variable.

### Use aesthetics to add project additional dimensions

Up until now, you have work with one or two variables on a single plot. But, with complex datasets it is useful to view multiple dimensions on each plot. The question is, how can this be done when graphics displays are limited to two dimensions? 

In this section, plot aesthetics are used to project additional dimensions. Some aesthetics are useful only for categorical variables, while others are useful for numeric variables. Keep in mind that not all plot aesthetics are equally effective. Tests of human perceptions have shown that people are very good as noticing small differences in position. This fact explains why scatter plots are so effective. In rough order of effectiveness these aesthetics are:
1. **Marker shape** is an effective indicator variable category. It is critical to select shapes which are easily distinguished by the viewer. 
2. **Marker size** shows values of a numeric variable. Be careful, as size is the span across the marker, not the area. 
3. **Marker color** is useful as an indicator of variable category. Color is the least effective of these three aesthetics in terms of human perception. Colors should be chosen to appear distinct. Additionally, keep in mind that many people, particularly men are red-green color blind. 

Categorical aesthetics, such as marker shape and color, are only effective if the differences in markers are perceptable. Using too many shapes or color creates a situation where the viewer cannot tell the differences between the categories. Typically a limit of about five to seven categories should be observed.

### Color

As was already discussed, changes in color are  hard for many people to perceive. None the less, color is useful for projecting a limited number of categories of a variable. Choice of distinctive color helps this situation. 

## Multi-axis views of data

Up to now, you have been working with plots with a single pair of axes. However, it is quite possible to create powerful data visualizations with multiple axes. These methods allows you to examine the relationships between many variables in one view. These multiple views aid in understanding of the many relationships in complex datasets. There are a number of powerful multi-axes plot methods. In this lab you will work with two commonly applied methods:
1. **Pair-wise scatter plots** or **scatter plot matrices** are an array of scatter plots with common axes along the rows and columns of the array. The diagonal of the array can be used to display distribution plots. The cells above or below the diagonal can be used for other plot types like contour density plots.
2. **Conditioned plots**, **facetted plots** or **small multiple plots** use **group-by** operations to create and display subsets of the dataset. The display can be a one or two dimensional array organized by the groupings of the dataset. 

### Pair-wise scatter plot

You will now apply a scatter plot matrix to the auto.price dataset. The code in the cell below uses the `pairplot` function from the Seaborn package. This function creates a basic scatter plot matrix below the diagonal. Kernel density estimates of each variable are displayed on the diagonal. Using the `map_upper` method 2d density plots are displayed above the diagonal. Run the cell below to create a scatter plot matrix of the numeric features in the dataset.

# Data Visualization

## Matplotlib: Standard Python Visualization Library<a id="10"></a>

The primary plotting library we will explore in the course is Matplotlib.  As mentioned on their website:

> Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

### Matplotlib.Pyplot

One of the core aspects of Matplotlib is `matplotlib.pyplot`. It is Matplotlib's scripting layer which we studied in details in the videos about Matplotlib. Recall that it is a collection of command style functions that make Matplotlib work like MATLAB. Each `pyplot` function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. In this lab, we will work with the scripting layer to learn how to generate line plots. In future labs, we will get to work with the Artist layer as well to experiment first hand how it differs from the scripting layer.



## Seaborn Library

### 1. Numerical Data Ploting
- relplot()
- scatterplot()
- lineplot()

### 2. Categorical Data Ploting
- catplot()
- boxplot()
- stripplot()
- swarmplot()
- etc...

### 3. Visualizing Distribution of the Data
- distplot()
- kdeplot()
- jointplot()
- rugplot()

### 4. Linear Regression and Relationship
- regplot()
- lmplot()

### 5. Controlling Ploted Figure Aesthetics
- figure styling
- axes styling
- color palettes
- etc..

## Subplots

Often times we might want to plot multiple plots within the same figure. For example, we might want to perform a side by side comparison of the box plot with the line plot of China and India's immigration.

To visualize multiple plots together, we can create a **`figure`** (overall canvas) and divide it into **`subplots`**, each containing a plot. With **subplots**, we usually work with the **artist layer** instead of the **scripting layer**.

Typical syntax is : <br>

```python
    fig = plt.figure() # create figure
    ax = fig.add_subplot(nrows, ncols, plot_number) # create subplots
```

Where

*   `nrows` and `ncols` are used to notionally split the figure into (`nrows` \* `ncols`) sub-axes,
*   `plot_number` is used to identify the particular subplot that this function is to create within the notional grid. `plot_number` starts at 1, increments across rows first and has a maximum of `nrows` \* `ncols` as shown below.

In the case when `nrows`, `ncols`, and `plot_number` are all less than 10, a convenience exists such that a 3-digit number can be given instead, where the hundreds represent `nrows`, the tens represent `ncols` and the units represent `plot_number`. For instance,

```python
   subplot(211) == subplot(2, 1, 1) 
```

produces a subaxes in a figure which represents the top plot (i.e. the first) in a 2 rows by 1 column notional grid (no grid actually exists, but conceptually this is how the returned subplot has been positioned).

In [None]:
# Check plot styles
#plt.style.available

In [None]:
markers = ['o','.',',','x','+','v','^','<','>','s','d']

plt.figure(figsize=(10,10))

for m in markers:
    plt.plot(np.random.rand(5),np.random.rand(5),m,label=m)

plt.legend()
plt.show()

### Pandas version Scatter Matrix

In [None]:
scatter_matrix(frame=df2, figsize=(12,12), alpha=0.5, diagonal='kde')

plt.show()

## FacetGrid (Building structured multi-plot grids)

The FacetGrid class is useful when you want to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of your dataset. A FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first two have obvious correspondence with the resulting array of axes; think of the hue variable as a third dimension along a depth axis, where different levels are plotted with different colors.

Each of relplot(), displot(), catplot(), and lmplot() use this object internally, and they return the object when they are finished so that it can be used for further tweaking.

### Seaborn Version

In [None]:
df.columns

`
sns.FacetGrid(data,  row=None,  col=None,  hue=None,  col_wrap=None,
    sharex=True, sharey=True,  height=3,  aspect=1,  palette=None,  row_order=None,  col_order=None,
    hue_order=None,  hue_kws=None,  dropna=False, legend_out=True,  despine=True,
    margin_titles=False,  xlim=None,  ylim=None,  subplot_kws=None,  gridspec_kws=None, size=None)
`

In [None]:
sns.FacetGrid(data=df, col="label")

In [None]:
g = sns.FacetGrid(data=df, col="label", height=3, aspect=1)

g.map(sns.boxplot, "drives")

g.fig.suptitle("My super title", y=1.05)


plt.show()

In [None]:
g = sns.FacetGrid(data=df, col="gender", hue=None, col_wrap=None, height=3, aspect=2, margin_titles=True)
g.map(sns.histplot, "age")
g.add_legend()
plt.show()

In [None]:
g = sns.FacetGrid(data=df, col="merchantstate", hue=None, col_wrap=4, height=3, aspect=2, margin_titles=True)
g.map(sns.kdeplot, "amount", color="green")
g.add_legend()
plt.show()

In [None]:
g = sns.FacetGrid(data=df, col="merchantstate", row="gender", hue=None, col_wrap=None, height=3, aspect=2, margin_titles=True)
g.map(sns.regplot, "balance", "amount", color="red", fit_reg=True, x_jitter=None)
g.add_legend()
plt.show()

In [None]:
g = sns.FacetGrid(tips, col="day", height=4, aspect=.5)
g.map(sns.barplot, "sex", "total_bill", order=["Male", "Female"])
g.add_legend()
plt.show()

In [None]:
g = sns.FacetGrid(attend, col="subject", col_wrap=4, height=2, ylim=(0, 10))
g.map(sns.pointplot, "solutions", "score", order=[1, 2, 3], color=".3", errorbar=None)

In [None]:
g = sns.FacetGrid(tips, row="sex", col="smoker", margin_titles=True, height=2.5)
g.map(sns.scatterplot, "total_bill", "tip", color="#334488")
g.set_axis_labels("Total bill (US Dollars)", "Tip")
g.set(xticks=[10, 30, 50], yticks=[2, 6, 10])
g.figure.subplots_adjust(wspace=.02, hspace=.02)

**==========================================================================================================**

In [None]:
df.columns

In [None]:
df_year = df.groupby(["year"], as_index=False).mean()
df_year.head()

In [None]:
df_year["pop"] = df_year["population"]/1000000

In [None]:
df_year["gdpnorm"] = df_year["gdp"]/100000000

In [None]:
df2 = df_year[['infant_mortality', 'life_expectancy', 'fertility','pop', 'gdpnorm']]
df2.head()

**==========================================================================================================**

## Histogram

### Pandas Version

In [None]:
df.hist(bins=50, figsize=(20,80), layout=(len(df.columns),2), grid=False)
plt.suptitle('Histogram Feature Distribution', x=0.5, y=1.02, ha='center', fontsize=20)

plt.tight_layout()
plt.show()

In [None]:
df_year.plot.hist(figsize=(10, 10), subplots=True)
plt.show()

In [None]:
df2.plot.hist(figsize=(10, 10), bins=20, subplots=False, stacked = True, orientation = 'horizontal')
plt.show()

In [None]:
df.year.plot(kind = "hist", figsize = (12,5), fontsize = 15, bins = 80, density = True)
plt.show()

In [None]:
df.duration.hist(figsize = (12,8), bins = 80, xlabelsize=15, ylabelsize= 15, cumulative = True)
plt.show()

In [None]:
df.amount.plot.hist(figsize = (12,8))
plt.show()

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, sharey=False, figsize=(20,15))
fig.suptitle('Main Title', y=1.0, size = 20)

df.plot.hist(y="age", bins=10, color='darkblue', ax=ax1)
ax1.set_title("Title")
ax1.set(xlabel="x", ylabel="y")


df.plot.hist(y="balance", bins=10, ax=ax2)
ax2.set_title("Title")
ax2.set(xlabel="x", ylabel="y")

df.plot.hist(y="amount", bins=10, ax=ax3)
ax3.set_title("Title")
ax3.set(xlabel="x", ylabel="y")

df.plot.hist(y="age", bins=10, ax=ax4)
ax4.set_title("Title")
ax4.set(xlabel="x", ylabel="y")

plt.tight_layout()
plt.show()

### Matplotlib Version

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.hist(x=df.amount, bins = 10, density=False, cumulative=False, color=None)
plt.show()

In [None]:
# plt.subplot(2,2,1)
# plt.subplot(2,2,2)
# plt.subplot(2,2,3)
# plt.subplot(2,2,4)
# plt.show()
# plt.tight_layout()

### Seaborn Version

In [None]:
fig, ax = plt.subplots(figsize=(12,5))

sns.histplot(x=df.amount, data=df, bins=10)

plt.show()

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, sharey=False, figsize=(20,15))
fig.suptitle('Main Title', y=1.0, size = 20)

sns.histplot(data=df, x=df.age, y=None, hue=None, ax=ax1)
ax1.set_title("Title")
ax1.set(xlabel="x", ylabel="y")


sns.histplot(data=df, x=df.age, y=None, hue=None, ax=ax2)
ax2.set_title("Title")
ax2.set(xlabel="x", ylabel="y")

sns.histplot(data=df, x=df.age, y=None, hue=None, ax=ax3)
ax3.set_title("Title")
ax3.set(xlabel="x", ylabel="y")

sns.histplot(data=df, x=df.age, y=None, hue=None, ax=ax4)
ax4.set_title("Title")
ax4.set(xlabel="x", ylabel="y")

plt.tight_layout()
plt.show()

In [None]:
# Create stacked histogram to compare department distribution of employees who left to that of employees who didn't
plt.figure(figsize=(11,8))
sns.histplot(data=df1, x='department', hue='left', discrete=1, 
             hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.xticks(rotation='45')
plt.title('Counts of stayed/left by department', fontsize=14);

**==========================================================================================================**

## Bar Plots

### Pandas Version

In [None]:
df_year["infant_mortality"].plot(kind = "bar", figsize=(12,8))
#df_year["infant_mortaility"].plot.bar()
plt.show()

In [None]:
df_year.plot(x="year", y="fertility", kind ="bar", figsize=(12,8))
plt.show()

In [None]:
df2.plot.bar(stacked = True, figsize=(12,8))
plt.show()

In [None]:
df2.plot.barh(stacked = True, figsize=(12,8))
plt.show()

### Matplotlib Version

In [None]:
df.columns

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.bar(x=df.txn_description, height=df.balance)
ax.set(title="title", xlabel="x", ylabel="y")

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.barh(y=df.txn_description, width=df.balance)
ax.set(title="title", xlabel="x", ylabel="y")
ax.set_xlabel('xlabel', fontsize=15)
ax.set_ylabel('ylabel', fontsize=15)

plt.show()

### Seaborn Version

In [None]:
df.head(1)

In [None]:
sns.catplot(x="txndescription", y="amount", kind='bar', data=df, hue ='gender', aspect=2, height=6)
plt.show()

In [None]:
sns.catplot(x="txndescription", y="amount", kind='bar', data=df, hue ='gender', aspect=2, height=6, palette="viridis")
plt.show()

In [None]:
# Plot 4 rows and 1 column (can be expanded)

fig, ax = plt.subplots(4,1, sharex=False, figsize=(16,16))
fig.suptitle('Main Title', y=1.0)


sns.barplot(x="txndescription", y="amount", hue=None, ci=95, data=df, orient=None, ax=ax[0])
ax[0].set_title('Title of the first chart')
#ax[0].tick_params('x', labelrotation=45)
ax[0].set_xlabel("")
ax[0].set_ylabel("")

sns.barplot(x="gender", y="amount", hue=None, ci=95, data=df, orient=None, ax=ax[1])
ax[1].set_title('Title of the second chart')
#ax[1].tick_params('x', labelrotation=45)
ax[1].set_xlabel("")
ax[1].set_ylabel("")

sns.barplot(x="merchantstate", y="amount", hue=None, ci=95, data=df, orient=None, ax=ax[2])
ax[2].set_title('Title of the third chart')
#ax[2].tick_params('x', labelrotation=45)
ax[2].set_xlabel("")
ax[2].set_ylabel("")

sns.barplot(x="movement", y="amount", hue=None, ci=95, data=df, ax=ax[3])
ax[3].set_title('Title of the fourth chart')
#ax[3].tick_params('x', labelrotation=45)
ax[3].set_xlabel("")
ax[3].set_ylabel("")

plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Sort Barplots by Values and Single Plot

fig = plt.figure(figsize=(20,10))


sns.barplot(x=None, y=df.vendorid2, data=df,
            order=df.sort_values('vendorid2', ascending=False).index)
plt.title("", size=20)
plt.xlabel("")
plt.ylabel("")
plt.xticks(rotation=90)
plt.ticklabel_format(style='plain', axis='y')

plt.tight_layout()
plt.show()

## Side by Side Bar Plots

In [None]:
df4 = pd.melt(df3, id_vars="cardID")
df4

In [None]:
plt.figure(figsize=(20,5))
sns.factorplot(x = 'a', y='b', hue="c" , data=df4, kind='bar')
plt.show()

## Horizontal/Vertical Stacked Column Bar Chart

In [None]:
# Create a stacked bar plot to visualize number of employees across department, comparing those who left with those who didn't
# In the legend, 0 (purple color) represents employees who did not leave, 1 (red color) represents employees who left
pd.crosstab(df1["department"], df1["left"]).plot(kind ='bar',color='mr')
plt.title('Counts of employees who left versus stayed across department')
plt.ylabel('Employee count')
plt.xlabel('Department')
plt.show()

**==========================================================================================================**

## Scatter Plots

A `scatter plot` (2D) is a useful method of comparing variables against each other. `Scatter` plots look similar to `line plots` in that they both map independent and dependent variables on a 2D graph. While the data points are connected together by a line in a line plot, they are not connected in a scatter plot. The data in a scatter plot is considered to express a trend. With further analysis using tools like regression, we can mathematically calculate this relationship and use it to predict trends outside the dataset.

### Pandas Version

In [None]:
df2.head()

In [None]:
df2.plot.scatter(x="infant_mortality" , y="life_expectancy", c=None, figsize=(12,8))

plt.show()

In [None]:
df2.plot.scatter(x="infant_mortality" , y="life_expectancy", c="fertility", figsize=(12,8))

plt.show()

In [None]:
ax = df2.plot.scatter(x="infant_mortality" , y="life_expectancy", figsize=(12,8))
df2.plot.scatter(x="infant_mortality" , y="pop", ax=ax, color='r')
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
df.plot(kind="scatter", x="amount", y="balance", color=(0.0, 0.0, 0.8), label='Inline label', ax=ax)

ax.set(title="title",
       xlabel="x",
       ylabel="y")

ax.axhline(y=100000, linewidth=3, color="red", linestyle="--")
ax.legend()
ax.set_xlim([0, 2000])
ax.set_ylim([0, 150000])
plt.show()

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, sharey=False, figsize=(20,15))
fig.suptitle('Main Title', y=1.0, size = 20)

df.plot.scatter(x="age", y="amount", s=None, c='b', ax=ax1)
ax1.set_title("Title")
ax1.set(xlabel="x", ylabel="y")
#ax1.set_xlim([0, 0])
#ax1.set_ylim([0, 0])

df.plot.scatter(x="age", y="amount", s=None, c='b', ax=ax2)
ax2.set_title("Title")
ax2.set(xlabel="x", ylabel="y")

df.plot.scatter(x="age", y="amount", s=None, c='b', ax=ax3)
ax3.set_title("Title")
ax3.set(xlabel="x", ylabel="y")

df.plot.scatter(x="age", y="amount", s=None, c='b', ax=ax4)
ax4.set_title("Title")
ax4.set(xlabel="x", ylabel="y")

plt.tight_layout()
plt.show()

### Matplotlib Version

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(x=df.date, y=df.amount, s=20, alpha=1.0, cmap="viridis")
plt.show()
#plt.colorbar()

In [None]:
df.columns

### Seaborn Version

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, sharey=False, figsize=(20,15))
fig.suptitle('Main Title', y=1.0, size = 20)


sns.scatterplot(x="age", y="amount", data=df, color='darkblue', s=200, ax=ax1)
ax1.set_title("Title", size = 20)
ax1.set(xlabel="x", ylabel="y")
#ax1.set_xlim([0, 0])
#ax1.set_ylim([0, 0])

sns.scatterplot(x="age", y="amount", data=df, color='darkblue', s=200, ax=ax2)
ax2.set_title("Title", size = 20)
ax2.set(xlabel="x", ylabel="y")

sns.scatterplot(x="age", y="amount", data=df, color='darkblue', s=200, ax=ax3)
ax3.set_title("Title", size = 20)
ax3.set(xlabel="x", ylabel="y")

sns.scatterplot(x="age", y="amount", data=df, color='darkblue', s=200, ax=ax4)
ax4.set_title("Title", size = 20)
ax4.set(xlabel="xb", ylabel="y")


plt.tight_layout()
plt.show()

In [None]:
# Create scatterplot of `average_monthly_hours` versus `satisfaction_level`, comparing employees who stayed versus those who left
plt.figure(figsize=(16, 9))
sns.scatterplot(data=df1, x='average_monthly_hours', y='satisfaction_level', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14')
plt.show()

In [None]:
# Create plot to examine relationship between `average_monthly_hours` and `promotion_last_5years`
plt.figure(figsize=(16, 3))
sns.scatterplot(data=df1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last 5 years', fontsize='14');

**==========================================================================================================**

## Rel Plot

To make complex comparisons easier, Seaborn provides a function called relplot, which is short for relationship plot. relplot takes some of the same arguments as scatterplot, such as data, x, y, and hue, but adds other arguments as well. 


sns.replot has the parameters height and aspect that control the dimensions of each subplot it would generate. You can use those to set your width and height, where width = aspect * height

**Splitting the Figure into Subplots / Facets**

As seen in the figures above, it can become confusing to distinguish the different categories (lithologies) when they are all lying on top of each other. One way around this is to create multiple subplots, one for each category

### Matplotlib Version

### Seaborn Version

In [None]:
df.columns

In [None]:
sns.relplot(x=df.age, y=df.amount, data=df, hue=df.gender, size=None, style=df.gender, col=None, height=5, aspect=2)
plt.show()

In [None]:
sns.relplot(x=df.age, y=df.amount, data=df, kind="line", height=5, aspect=2, ci=None)
plt.show()

### Relplot subplots

In [None]:
sns.relplot(x=df.age, y=df.amount, data=df, height=5, aspect=2, ci=None, col=df.txndescription, col_wrap=2)
plt.show()

In [None]:
sns.relplot(x=df.age, y=df.balance, data=df, height=5, aspect=2, ci=None, col=df.movement, col_wrap=2)
plt.show()

In [None]:
sns.relplot(x=df.age, y=df.balance, data=df, height=5, aspect=2, ci=None, col=df.movement, col_wrap=2, hue=df.gender)
plt.show()

**==========================================================================================================**

## Line Plots

**What is a line plot and why use it?**

A line chart or line plot is a type of plot which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields.
Use line plot when you have a continuous data set. These are best suited for trend-based visualizations of data over a period of time.

Line plot is a handy tool to display several dependent variables against one independent variable. However, it is recommended that no more than 5-10 lines on a single graph; any more than that and it becomes difficult to interpret.

### Pandas Method

In [None]:
df.plot(subplots = True, figsize=(15, 30), sharex=False, sharey=False)

plt.tight_layout()
plt.show()

In [None]:
df2.plot(subplots=True, sharex = False, layout = (3, 2), figsize = (12, 8))

plt.tight_layout()
plt.show()

In [None]:
df.head()

In [None]:
df.year.plot(subplots=True, figsize=(15, 5), legend = False, title="title")
plt.show()

In [None]:
x = df.infant_mortality
ax = x.plot()
df.year.plot(subplots=False, figsize=(15, 5), secondary_y= True, legend = True, title="title")
plt.show()

In [None]:
df_ts.age.plot(figsize=(12, 5), fontsize= 13, c = "darkblue")
plt.title("Title", fontsize = 15)
plt.legend(loc = 3, fontsize = 15)
plt.xlabel("x", fontsize = 13)
plt.ylabel("y", fontsize = 13)
#plt.text(1000, 2600, 'Insert Text')
#plt.grid()
plt.show()

### Matplotlib Version

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax.plot(df_ts.index, c="blue", linewidth=2, linestyle = "--", markersize=5)
#ax.set(title="Plot", xlabel="x", ylabel="y")

ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_xlim()
ax.set_ylim()
ax.set_title("Plot")
ax.legend(["legend"], loc='best')
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12,8))

ax[0,0].plot(df_ts.balance)
ax[0,0].set_xlabel("x")
ax[0,0].set_ylabel("y")
ax[0,0].set_title("title")

ax[0,1].plot(df_ts.age)
ax[0,1].set_xlabel("x")
ax[0,1].set_ylabel("y")
ax[0,1].set_title("title")

ax[1,0].plot(df_ts.balance)
ax[1,0].set_xlabel("x")
ax[1,0].set_ylabel("y")
ax[1,0].set_title("title")

ax[1,1].plot(df_ts.amount)
ax[1,1].set_xlabel("x")
ax[1,1].set_ylabel("y")
ax[1,1].set_title("title")

plt.tight_layout()
plt.show()

In [None]:
# Save Image
# fig.savefig()

### Seaborn Version

In [None]:
fig, ((ax1, ax2, ax3)) = plt.subplots(nrows=3, ncols=1, figsize=(12,8))

sns.lineplot(x ='date', y ='amount', style=None, hue=None, data=df_ts, markers=True, ci=95, ax=ax1)
ax1.set_title("Title", size = 20)
ax1.set(xlabel="x", ylabel="y")
#ax1.set_xlim([0, 0])
#ax1.set_ylim([0, 0])

sns.lineplot(x ='date', y ='amount', style=None, hue=None, data=df_ts, markers=True, ci=95, ax=ax2)
ax2.set_title("Title", size = 20)
ax2.set(xlabel="x", ylabel="y")
#ax2.set_xlim([0, 0])
#ax2.set_ylim([0, 0])

sns.lineplot(x ='date', y ='amount', style=None, hue=None, data=df_ts, markers=True, ci=95, ax=ax3)
ax3.set_title("Title", size = 20)
ax3.set(xlabel="x", ylabel="y")
#ax3.set_xlim([0, 0])
#ax3.set_ylim([0, 0])

plt.tight_layout()
plt.show()

In [None]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df_ts.index, y=df_ts.amount, data=df_ts, estimator='mean')
plt.title("", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.legend(['',''])
plt.show()

In [None]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.month,y=df.amount,data=df, estimator=None)
plt.title("", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.legend(['',''])
plt.show()

In [None]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.month,y=df.amount,data=df, estimator=None)
plt.title("", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.legend(['',''])
plt.show()

**==========================================================================================================**

## Point Plot 

In [None]:
df.head(1)

In [None]:
sns.catplot(x ='gender', y='age', hue=None, kind='point', data=df)
plt.show()

**==========================================================================================================**

## Box Plots

A `box plot` is a way of statistically representing the *distribution* of the data through five main dimensions:

*   **Minimum:** The smallest number in the dataset excluding the outliers.
*   **First quartile:** Middle number between the `minimum` and the `median`.
*   **Second quartile (Median):** Middle number of the (sorted) dataset.
*   **Third quartile:** Middle number between `median` and `maximum`.
*   **Maximum:** The largest number in the dataset excluding the outliers.

### Pandas Version

In [None]:
df.boxplot(figsize=(20,10), color='blue', fontsize=15, grid=False)
plt.suptitle('BoxPlots Feature Distribution', x=0.5, y=1.02, ha='center', fontsize=20)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
df2.plot(kind='box', figsize=(12, 5), subplots=False)

plt.suptitle('title', size = 15)
plt.ylabel('y')

plt.show()

In [None]:
# horizontal box plots
df2.plot(kind='box', figsize=(10, 7), color='blue', vert=False)

plt.title('Title', size = 15)
plt.xlabel('x')

plt.show()

### Matplotlib Version

In [None]:
fig, ax = plt.subplots(figsize=(12,5))
ax.boxplot(df.amount, vert=True)
ax.set(title="title", xlabel="x", ylabel="y")

ax.legend(["legend"])
plt.show()

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12,8))

ax[0,0].boxplot(df.amount, vert=False)
ax[0,0].set_xlabel("x")
ax[0,0].set_ylabel("y")
ax[0,0].set_title("title")

ax[0,1].boxplot(df.amount, vert=False)
ax[0,1].set_xlabel("x")
ax[0,1].set_ylabel("y")
ax[0,1].set_title("title")

ax[1,0].boxplot(df.amount, vert=False)
ax[1,0].set_xlabel("x")
ax[1,0].set_ylabel("y")
ax[1,0].set_title("title")

ax[1,1].boxplot(df.amount, vert=False)
ax[1,1].set_xlabel("x")
ax[1,1].set_ylabel("y")
ax[1,1].set_title("title")

plt.tight_layout()
plt.show()

### Seaborn Version

In [None]:
# Create boxplot to visualize the outliers
### YOUR CODE HERE ###

g = sns.boxplot(data=df[["passenger_count","tip_amount","total_amount", "trip_duration"]], showfliers=True);
g.set_title("4 Variables with Outliers",fontsize=20)

In [None]:
# Create boxplot to visualize distribution of data without outliers
### YOUR CODE HERE ###

g = sns.boxplot(data=df[["passenger_count","tip_amount","total_amount", "trip_duration"]], showfliers=False);
g.set_title("4 Variables without Outliers",fontsize=20)

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, figsize=(20,20))
fig.suptitle('Main Title', y=1.0)

sns.boxplot(x="cyl", y="mpg", data=df, ax=ax1)
ax1.set_title('Title of the first chart', size=20)
#ax1.tick_params('x', labelrotation=45)
ax1.set_xlabel("")
ax1.set_ylabel("")

sns.boxplot(x="gear", y="mpg", data=df, ax=ax2)
ax2.set_title('Title of the second chart', size=20)
#ax2.tick_params('x', labelrotation=45)
ax2.set_xlabel("")
ax2.set_ylabel("")

sns.boxplot(x="carb", y="mpg", data=df, ax=ax3)
ax3.set_title('Title of the third chart', size=20)
#ax3.tick_params('x', labelrotation=45)
ax3.set_xlabel("")
ax3.set_ylabel("")

sns.boxplot(x="", y="", data=df, ax=ax4)
ax4.set_title('Title of the fourth chart', size=20)
#ax4.tick_params('x', labelrotation=45)
ax4.set_xlabel("")
ax4.set_ylabel("")

plt.tight_layout()
plt.show()

In [None]:
# Create a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(6,6))
plt.title('Boxplot to detect outliers for tenure', fontsize=12)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
sns.boxplot(x=df1['tenure'])
plt.show()

In [None]:
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (22,8))

# Create boxplot showing `average_monthly_hours` distributions for `number_project`, comparing employees who stayed versus those who left
sns.boxplot(data=df1, x='average_monthly_hours', y='number_project', hue='left', orient="h", ax=ax[0])
ax[0].invert_yaxis()
ax[0].set_title('Monthly hours by number of projects', fontsize='14')

# Create histogram showing distribution of `number_project`, comparing employees who stayed versus those who left
tenure_stay = df1[df1['left']==0]['number_project']
tenure_left = df1[df1['left']==1]['number_project']
sns.histplot(data=df1, x='number_project', hue='left', multiple='dodge', shrink=2, ax=ax[1])
ax[1].set_title('Number of projects histogram', fontsize='14')

# Display the plots
plt.show()

**==========================================================================================================**

## Boxen Plot

### Seaborn Version

In [None]:
df.columns

In [None]:
sns.catplot(x='txndescription', y='amount', kind='boxen', data=df, ci=95, height=6, aspect=2)
plt.show()

In [None]:
sns.catplot(x='txndescription', y='amount', kind='boxen', hue='gender', data=df, ci=95, height=6, aspect=2)
plt.show()

**==========================================================================================================**

## Violin Plot

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.violinplot(x="movement", y="balance", hue="gender", data=df)

ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

In [None]:
sns.catplot(x='balance', y='movement', data=df, kind='violin', hue='gender', split = True, height=6, aspect=2)
plt.show()

**==========================================================================================================**

## Count Plots

### Matplotlib Version

### Seaborn Version

In [None]:
df.select_dtypes(include="object")

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.countplot(x="txndescription", data=df)
ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.countplot(x="gender", data=df)
ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.countplot(x="merchantstate", hue="movement", data=df)
ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.countplot(x="merchantsuburb", data=df)
ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, figsize=(20,20))
fig.suptitle('Main Title', y=1.0)

sns.countplot(x="", data=df, ax=ax1)
ax1.set_title('Title of the first chart', size=20)
#ax1.tick_params('x', labelrotation=45)
ax1.set_xlabel("")
ax1.set_ylabel("")

sns.countplot(x="", data=df, ax=ax2)
ax2.set_title('Title of the second chart', size=20)
#ax2.tick_params('x', labelrotation=45)
ax2.set_xlabel("")
ax2.set_ylabel("")

sns.countplot(x="", data=df, ax=ax3)
ax3.set_title('Title of the third chart', size=20)
#ax3.tick_params('x', labelrotation=45)
ax3.set_xlabel("")
ax3.set_ylabel("")

sns.countplot(x="", data=df, ax=ax4)
ax4.set_title('Title of the fourth chart', size=20)
#ax4.tick_params('x', labelrotation=45)
ax4.set_xlabel("")
ax4.set_ylabel("")

plt.tight_layout()
plt.show()

**==========================================================================================================**

## Strip Plot

In [None]:
df.head(1)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.stripplot(x="movement", y="balance", hue="gender", data=df, jitter=True)

ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

**==========================================================================================================**

## Swarm Plot

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

sns.swarmplot(x="movement", y="age", hue="gender", data=df)

ax.set_title('Title', size=15)
ax.tick_params('x', labelrotation=45)
ax.set_xlabel("")
ax.set_ylabel("")
#ax.legend()

plt.show()

## Category Plots

### Matplotlib Version

### Seaborn Version

In [None]:
df.head(1)

In [None]:
sns.catplot(x="movement", y="age", data=df, height=8, aspect=2, ci=None, kind="strip")
plt.xlabel("x",fontsize=20)
plt.ylabel("y",fontsize=20)
plt.show()

In [None]:
sns.catplot(x="movement", y="age", data=df, height=8, aspect=2, ci=None, kind="swarm")
plt.xlabel("x",fontsize=20)
plt.ylabel("y",fontsize=20)
plt.show()

In [None]:
sns.catplot(x="movement", y="age", data=df, height=8, aspect=2, ci=None, kind="boxen")
plt.xlabel("x",fontsize=20)
plt.ylabel("y",fontsize=20)
plt.show()

In [None]:
sns.catplot(x="movement", y="age", hue="gender", data=df, height=8, aspect=2, ci=None, kind="box")
plt.xlabel("x",fontsize=20)
plt.ylabel("y",fontsize=20)
plt.show()

In [None]:
sns.catplot(x="age", y="amount", hue="gender", data=df, height=8, aspect=2, ci=None, kind="box")
plt.xlabel("x",fontsize=20)
plt.xlim(0, 7)
plt.ylabel("y",fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(20,20))


g = sns.catplot(x='', hue = '', row = '', kind='count', data=df, height = 3, aspect = 1)

g.set_xlabels("x")
g.set_ylabels("y")
g.set(xlim=(0,10))
g.set(ylim=(0,100))



g.set_xticklabels(rotation=90)
g.set_yticklabels(rotation=90)


plt.suptitle('', x=0.5, y=1.02, ha='center', fontsize=20)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,20))

sns.catplot(x="calories", y="restaurant",

                hue="is_salad", ci=None,

                data=df_calories, color=None, linewidth=3, showfliers = False,

                orient="h", height=20, aspect=1, palette=None,

                kind="box", dodge=True)

plt.xlabel("", size=20)
plt.ylabel("", size=20)
plt.suptitle('', x=0.5, y=1.02, ha='center', fontsize=20)

plt.tight_layout()
plt.show()

**==========================================================================================================**

## Joint Plot

jointplot is a figure-level function so putting them in subplots is very hacky

### Matplotlib Version

### Seaborn Version

In [None]:
df.head()

In [None]:
# Stack Jointplots in long format

sns.jointplot(x='age', y='amount', data=df, kind='scatter')

sns.jointplot(x='age', y='amount', data=df, kind='kde')

sns.jointplot(x='age', y='amount', data=df, kind='hist')

sns.jointplot(x='age', y='amount', data=df, kind='hex')

sns.jointplot(x='age', y='amount', data=df, kind='reg')

sns.jointplot(x='age', y='amount', data=df, kind='resid')

plt.show()

In [None]:
sns.jointplot(x='', y='',data=df, kind='reg',scatter_kws={'color':'k'},line_kws={'color':'red'})

sns.jointplot(x='', y='',data=df, kind='reg',scatter_kws={'color':'k'},line_kws={'color':'red'})

sns.lmplot(x='num_items', y='total_value', data=df, scatter_kws={'s': 1, 'alpha': 0.1}, height=5, aspect=1,
           line_kws={'lw': 2, 'color': 'red'})

sns.lmplot(x='num_items', y='total_value', data=df, scatter_kws={'s': 1, 'alpha': 0.1}, height=5, aspect=1,
           line_kws={'lw': 2, 'color': 'red'})

plt.tight_layout()
plt.show()

**==========================================================================================================**

## Heatmap

### Matplotlib Version

### Seaborn Version

In [None]:
df.head()

In [None]:
plt.figure(figsize=(16,9))
sns.heatmap(data=df.corr(), cmap="coolwarm", annot=True, fmt='.2f', linewidths=2)
plt.title("Correlation Heatmap", fontsize=20)
plt.show()

**==========================================================================================================**

## Regression plot

### Matplotlib Version

### Seaborn Version

In [None]:
# Plot 1 rows and 2 columns (can be expanded)
line_color = {'color': 'red'}
fig, ax = plt.subplots(1,2, sharex=False, figsize=(16,5))
fig.suptitle('Regression Plots')

sns.regplot(x=df.wt, y=df.mpg, data=df, ax=ax[0], ci=None, line_kws=line_color)
ax[0].set_title('title')
#ax[0].tick_params('x', labelrotation=45)
ax[0].set_xlabel("x")
ax[0].set_ylabel("y")`

sns.regplot(x=df.hum, y=df.cnt, data=df, ax=ax[1], ci=None, line_kws=line_color)
ax[1].set_title('title')
#ax[1].tick_params('x', labelrotation=45)
ax[1].set_xlabel("x")
ax[1].set_ylabel("y")

plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15, 10))
sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+')
plt.show()

In [None]:
plt.figure(figsize=(15, 10))
ax = sns.regplot(x='year', y='total', data=df_tot, color='green', marker='+', scatter_kws={'s': 200})

ax.set(xlabel='Year', ylabel='Total Immigration') # add x- and y-labels
ax.set_title('Total Immigration to Canada from 1980 - 2013') # add title
plt.show()

In [None]:
line_color = {'color': 'red'}
fig , ax = plt.subplots(2,2, figsize=(20,20))

#Feature

ax1 = sns.regplot(x=X_test.bmi, y=lr_pred, line_kws=line_color, ax=ax[0,0])
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.set_title("Plot 1", size=15)

#Feature

ax2 = sns.regplot(x=X_test.s5, y=lr_pred, line_kws=line_color, ax=ax[0,1])
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_title("Plot 2", size=15)

#Feature

ax3 = sns.regplot(x=X_test.bp, y=lr_pred, line_kws=line_color, ax=ax[1,0])
ax3.set_xlabel("x")
ax3.set_ylabel("y")
ax3.set_title("Plot 3", size=15)

#Feature

ax4 = sns.regplot(x=X_test.s4, y=lr_pred, line_kws=line_color, ax=ax[1,1])
ax4.set_xlabel("x")
ax4.set_ylabel("y")
ax4.set_title("Plot 4", size=15)
plt.suptitle('Regression Plots', x=0.5, y=0.9, ha='center', fontsize=20)
plt.show()

**==========================================================================================================**

## Pairplots

### Matplotlib Version

### Seaborn Version

In [None]:
df.head(1)

In [None]:
df.columns

In [None]:
# Take only continous variables

df_cont = df[['age', 'balance', 'amount']]

In [None]:
sns.pairplot(df_cont.sample(300), height=4, aspect=1)
plt.suptitle('Pairplots of features', x=0.5, y=1.02, ha='center', fontsize=20)

plt.show()

In [None]:
# Compare to target variable

plt.figure(figsize=(20,20))
plt.suptitle('Pairplots of features', x=0.5, y=1.02, ha='center', fontsize=20)
sns.pairplot(df.sample(300), x_vars=['tripdistance', 'fareamount', 'tipamount', 
              'totalamount'], y_vars=["duration"])
plt.show()

In [None]:
# For small sample size
plt.figure(figsize=(20,20))
plt.suptitle('Pairplots of features', x=0.5, y=1.02, ha='center', fontsize=20)
sns.pairplot(df)
plt.show()

**==========================================================================================================**

## PairGrid

In [None]:
g = sns.PairGrid(df_cont.sample(300))
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels = 10)

plt.show()

In [None]:
g = sns.PairGrid(df_cont.sample(300))
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.scatterplot)

plt.show()

In [None]:
g = sns.PairGrid(iris)
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)

In [None]:
g = sns.PairGrid(iris, hue="species")
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend()

In [None]:
g = sns.PairGrid(iris, vars=["sepal_length", "sepal_width"], hue="species")
g.map(sns.scatterplot)

**==========================================================================================================**

## Area Plot

In [None]:
df2.head()

In [None]:
df2.plot(kind = "area", title = "title", figsize=(15, 10))

plt.show()

In [None]:
df2.plot(kind = "area", title = "title", figsize=(15, 10), stacked = False, alpha=0.5)

plt.show()

**==========================================================================================================**

## Hex Plot

### Pandas version

In [None]:
df2.plot.hexbin(x = 'life_expectancy', y = 'pop', gridsize=10, C = "infant_mortality", figsize=(12,8))

plt.title("title")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

**==========================================================================================================**

## Pie Charts

A `pie chart` is a circular graphic that displays numeric proportions by dividing a circle (or pie) into proportional slices. You are most likely already familiar with pie charts as it is widely used in business and media. We can create pie charts in Matplotlib by passing in the `kind=pie` keyword.

*   `autopct` -  is a string or function used to label the wedges with their numeric value. The label will be placed inside the wedge. If it is a format string, the label will be `fmt%pct`.
*   `startangle` - rotates the start of the pie chart by angle degrees counterclockwise from the x-axis.
*   `shadow` - Draws a shadow beneath the pie (to give a 3D feel).

### Pandas version

In [None]:
df2.iloc[0]

In [None]:
df2.iloc[0].plot.pie(figsize = (12, 8))

plt.show()

In [None]:
df3 = df2.head(3).T
df3

In [None]:
df3.plot.pie(subplots=True, figsize = (30, 30))
plt.show()

In [None]:
df3.plot.pie(subplots=True, figsize = (30, 30), fontsize = 24)
plt.show()

In [None]:
df3.plot.pie(subplots=True, figsize = (30, 30), fontsize = 24, autopct = '%.2f')
plt.show()

### Matplotlib Version

In [None]:
df.groupby(['txn_description'])["age"].mean()

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
ax.pie(x=df.groupby(['txn_description'])["age"].mean(), autopct='%.2f')

plt.show()

### Seaborn Version

In [None]:
df.head()

In [None]:
piechartdf = df.groupby(["vendorid2"], as_index=False)["duration"].mean()
piechartdf

In [None]:
# autopct create %, start angle represent starting point
piechartdf['duration'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', # add in percentages
                            startangle=90,     # start angle 90° (Africa)
                            shadow=False,       # add shadow      
                            )

plt.title('Title', size=15)
plt.axis('equal') # Sets the pie chart to look like a circle.

plt.show()

In [None]:
colors_list = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue', 'lightgreen', 'pink']
explode_list = [0.1, 0, 0, 0, 0.1, 0.1] # ratio for each continent with which to offset each wedge.

piechartdf['mpg'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=False,       
                            labels=None,         # turn off labels on pie chart
                            pctdistance=1.12,    # the ratio between the center of each pie slice and the start of the text generated by autopct 
                            colors=colors_list,  # add custom colors
                            #explode=explode_list # 'explode' lowest 3 continents
                            )

# scale the title up by 12% to match pctdistance
plt.title('Title', y=1.12, size=20) 

plt.axis('equal') 

# add legend
plt.legend(labels=piechartdf.index, loc='upper left') 

plt.show()

**==========================================================================================================**

## Tree Plots

### Matplotlib Version

### Seaborn Version

In [None]:
df.head(1)

In [None]:
plt.figure(figsize=(20,20))
labels=df['drivewheel']
sizes=df['price']
squarify.plot(sizes=sizes,label=labels, color="green")
plt.plot()

**==========================================================================================================**

## Plotly Express Graphs

In [None]:
fig = px.bar(data_frame=df, x="", y="", 
             width=600, height=400, title="",
             hover_data=['lifeExp', 'gdpPercap'], color='lifeExp',
             labels={'pop':'population of Canada'})
fig.show()

In [None]:
fig = px.pie(data_frame=df3, names="", values="", 
             width=600, height=400, title="",
             )
fig.show()

In [None]:
fig = px.scatter(data_frame=df, x="", y="", color="continent", title="",
                 size="pop", size_max=10, hover_name="country")
fig.show()

In [None]:
fig = px.scatter(data_frame=df, x="", y="", color="continent", size="pop", size_max=60, title="",
          hover_name="country", facet_col="continent", log_x=True)
fig.show()

In [None]:
fig = px.scatter(data_frame=df, x="", y="", animation_frame="year", animation_group="country",title="",
           size="pop", color="continent", hover_name="country", facet_col="continent",
           log_x=True, size_max=45, range_x=[100,100000], range_y=[25,90],
           labels=dict(pop="Population", gdpPercap="GDP per Capita", lifeExp="Life Expectancy"))
fig.show()

In [None]:
fig = px.box(data_frame=df, x="time", y="total_bill", facet_col="quartilemethod", color="quartilemethod")

fig.show()

In [None]:
fig = px.scatter_matrix(data_frame=df, title="Heatmap", width=2000, height=2000,
                       labels={col:col.replace('_', ' ') for col in df.columns})
fig.show()

In [None]:
fig = px.choropleth(data_frame=df, locations="iso_alpha", color="lifeExp", hover_name="country", 
                    animation_frame="year", title="",
                    color_continuous_scale=px.colors.sequential.Plasma, projection="natural earth")
fig.show()

In [None]:
fig = px.line(data_frame=df, x="", y="", color="continent", line_group="country", hover_name="country",
              title="", line_shape="spline", render_mode="svg", 
              labels={'actual_productivity': 'Actual Productivity'})
fig.show() 

In [None]:
fig = px.area(data_frame=df, x="", y="", color="continent", line_group="country", title="",
              labels={'actual_productivity': 'Actual Productivity'})
fig.show()

In [None]:
fig = px.imshow(df,labels=dict(x= "Year",color= "GDP%")) #Code A
fig.layout.title = "GDP Annual Growth Rate" # Code B
fig.show()

**==========================================================================================================**

## Geospatial Analysis

In [None]:
mapping = usa_stores[['City','Latitude','Longtitude','Sentiment','Revenue']]
mapping

In [None]:
m = folium.Map(location=[37.090240,-95.712891], zoom_start=5)
m

In [None]:
map_df = pd.DataFrame(mapping.groupby(["City","Latitude","Longtitude"]).mean())
map_df

In [None]:
folium.Marker(location=[33.76,-84.42], popup="Atlanta", tooltip="Sentiment=83.69, Revenue=292.57").add_to(m)
folium.Marker(location=[36.23,-115.27], popup="Las Vegas", tooltip="Sentiment=83.72, Revenue=187.40").add_to(m)
folium.Marker(location=[34.11,-118.41], popup="Los Angeles", tooltip="Sentiment=83.75, Revenue=255.95").add_to(m)
folium.Marker(location=[40.69,-73.92], popup="New York", tooltip="Sentiment=83.71, Revenue=328.38").add_to(m)
folium.Marker(location=[32.83,-117.12], popup="San Diego", tooltip="Sentiment=83.70, Revenue=272.93").add_to(m)

m

In [None]:
m.save("filename.html")

In [None]:
state_geo = f"malaysia.geojson"

In [None]:
map2 = folium.Map(location=[4.210484,108.975766], zoom_start=6)

And now to create a `Choropleth` map, we will use the *choropleth* method with the following main parameters:

1.  `geo_data`, which is the GeoJSON file.
2.  `data`, which is the dataframe containing the data.
3.  `columns`, which represents the columns in the dataframe that will be used to create the `Choropleth` map.
4.  `key_on`, which is the key or variable in the GeoJSON file that contains the name of the variable of interest. To determine that, you will need to open the GeoJSON file using any text editor and note the name of the key or variable that contains the name of the countries, since the countries are our variable of interest. In this case, **name** is the key in the GeoJSON file that contains the name of the countries. Note that this key is case_sensitive, so you need to pass exactly as it exists in the GeoJSON file.

In [None]:
folium.Choropleth(geo_data=state_geo, name="choropleth").add_to(map2)

**==========================================================================================================**

# Statistics

## Hypothesis Testing

The goal of hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” The first step is to quantify the size of the apparent effect by choosing a test statistic (t-test, ANOVA, etc). The next step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. Then compute the p-value, which is the probability of the null hypothesis being true, and finally interpret the result of the p-value, if the value is low, the effect is said to be statistically significant, which means that the null hypothesis may not be accurate.

## Conduct a hypothesis test

Now that you’ve organized your data and simulated random sampling, you’re ready to conduct your hypothesis test. Recall that the two-sample t-test is the standard approach for comparing the means of two independent samples. Let's review the steps for conducting a hypothesis test:

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a significance level
3.   Find the p-value 
4.   Reject or fail to reject the null hypothesis

### Step 1: State the null hypothesis and the alternative hypothesis

The **null hypothesis** is a statement that is assumed to be true unless there is convincing evidence to the contrary. The **alternative hypothesis** is a statement that contradicts the null hypothesis, and is accepted as true only if there is convincing evidence for it. 

In a two-sample t-test, the null hypothesis states that there is no difference between the means of your two groups. The alternative hypothesis states the contrary claim: there is a difference between the means of your two groups. 

We use $H_0$ to denote the null hypothesis, and $H_A$ to denote the alternative hypothesis.

*   $H_0$: There is no difference in the mean district literacy rates between STATE21 and STATE28
*   $H_A$: There is a difference in the mean district literacy rates between STATE21 and STATE28



### Step 2: Choose a significance level

The **significance level** is the threshold at which you will consider a result statistically significant. This is the probability of rejecting the null hypothesis when it is true. The education department asks you to use their standard level of 5%, or 0.05. 

### Step 3: Find the p-value

**P-value** refers to the probability of observing results as or more extreme than those observed when the null hypothesis is true.

Based on your sample data, the difference between the mean district literacy rates of STATE21 and STATE28 is 6.2 percentage points. Your null hypothesis claims that this difference is due to chance. Your p-value is the probability of observing an absolute difference in sample means that is 6.2 or greater *if* the null hypothesis is true. If the probability of this outcome is very unlikely - in particular, if your p-value is *less than* your significance level of 5% – then you will reject the null hypothesis.

#### `scipy.stats.ttest_ind()`

For a two-sample $t$-test, you can use `scipy.stats.ttest_ind()` to compute your p-value. This function includes the following arguments:

*   `a`: Observations from the first sample. 
*   `b`: Observations from the second sample.
*   `equal_var`: A boolean, or true/false statement, which indicates whether the population variance of the two samples is assumed to be equal. In our example, you don’t have access to data for the entire population, so you don’t want to assume anything about the variance. To avoid making a wrong assumption, set this argument to `False`. 

Reference: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html.


Now you’re ready to write your code and enter the relevant arguments: 

*   `a`: Your first sample refers to the district literacy rate data for STATE21, which is stored in the `OVERALL_LI` column of your variable `sampled_ state21`.
*   `b`: Your second sample refers to the district literacy rate data for STATE28, which is stored in the `OVERALL_LI` column of your variable `sampled_ state28`.
*   `equal_var`: Set to `False` because you don’t want to assume that the two samples have the same variance.

### Step 4: Reject or fail to reject the null hypothesis

To draw a conclusion, compare your p-value with the significance level.

*   If the p-value is less than the significance level, you conclude there is a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you reject the null hypothesis $H_0$.
*   If the p-value is greater than the significance level, you conclude there is *not* a statistically significant difference in the mean district literacy rates between STATE21 and STATE28. In other words, you fail to reject the null hypothesis $H_0$.

Your p-value of 0.0064, or 0.64%, is less than the significance level of 0.05, or 5%. So, you *reject* the null hypothesis, and conclude that there is a statistically significant difference between the mean district literacy rates of the two states STATE21 and STATE28. 

### Simulate random sampling

Now that you’ve organized your data, use the `sample()` function to take a random sample of 20 districts from each state. First, name a new variable: `sampled_state21`. Then, enter the arguments of the `sample()` function. 

*   `n`: Your sample size is `20`. 
*   `replace`: Choose `True` because you are sampling with replacement.
*   `random_state`: Choose an arbitrary number for the random seed – how about `13490`. 
. 

In [None]:
sampled_state21 = state21.sample(n=20, replace = True, random_state=13490)

In [None]:
sampled_state28 = state28.sample(n=20, replace = True, random_state=39103)

### T-Test

We will be using the t-test for independent samples. For the independent t-test, the following assumptions must be met.

-   One independent, categorical variable with two levels or group
-   One dependent continuous variable
-   Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
-   The dependent variable must follow a normal distribution
-   Assumption of homogeneity of variance


State the hypothesis

-   $H_0: µ\_1 = µ\_2$ ("there is no difference in evaluation scores between male and females")
-   $H_1: µ\_1 ≠ µ\_2$ ("there is a difference in evaluation scores between male and females")


## Levene's Test

We can use the Levene's Test in Python to check test significance

```
scipy.stats.levene(ratings_df[ratings_df['gender'] == 'female']['eval'],
                   ratings_df[ratings_df['gender'] == 'male']['eval'], center='mean')
```
**since the p-value is greater than 0.05 we can assume equality of variance**

**LeveneResult(statistic=0.19032922435292574, pvalue=0.6628469836244741)**

## T-Test

### One Sample T-Test

In [None]:
t, p = scipy.stats.ttest_1samp ( a=sampled_state21.dose, popmean=1.166667 )

In [None]:
print("T-test value is: ", t)
print("p-value value is: ", p)

### Two Samples T-Test

```
scipy.stats.ttest_ind(ratings_df[ratings_df['gender'] == 'female']['eval'],
                   ratings_df[ratings_df['gender'] == 'male']['eval'], equal_var = True)

```

**Ttest_indResult(statistic=-3.249937943510772, pvalue=0.0012387609449522217)**

**Conclusion:** Since the p-value is less than alpha value 0.05, we reject the null hypothesis as there is enough proof that there is a statistical difference in teaching evaluations based on gender

In [None]:
t, p = scipy.stats.ttest_ind (a=sampled_state21.len, b=sampled_state28.len, equal_var = True/False)

In [None]:
print("T-test value is: ",t)
print("p-value value is: ",p)

### ResearchPy

```
rp.ttest(group1= df2['Median'][df['Major_category'] == 'Computers & Mathematics'], group1_name= "CM",
         group2= df2['Median'][df['Major_category'] == 'Education'], group2_name= "EDU",
         equal_variances=True, paired=False)
```

### ANOVA

<p>The Analysis of Variance  (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>

<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>

<p><b>P-value</b>:  P-value tells how statistically significant our calculated score value is.</p>

<p>If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.</p>



* **One-way ANOVA:** Compares the means of one continuous dependent variable based on three or more groups of one categorical variable.
* **Two-way ANOVA:** Compares the means of one continuous dependent variable based on three or more groups of two categorical variables.


First, we group the data into cateries as the one-way ANOVA can't work with continuous variable - using the example from the video, we will create a new column for this newly assigned group our categories will be teachers that are:

-   40 years and younger
-   between 40 and 57 years
-   57 years and older

```
ratings_df.loc[(ratings_df['age'] <= 40), 'age_group'] = '40 years and younger'
ratings_df.loc[(ratings_df['age'] > 40)&(ratings_df['age'] < 57), 'age_group'] = 'between 40 and 57 years'
ratings_df.loc[(ratings_df['age'] >= 57), 'age_group'] = '57 years and older'

```

In order to run ANOVA, we need to create a regression model. To do this, we'll import the `statsmodels.api` package and the `ols()` function. Next, we'll create a simple linear regression model where the X variable is `color`, which we will code as categorical using `C()`. Then, we'll fit the model to the data, and generate model summary statistics.

State the hypothesis

-   $H_0: µ\_1 = µ\_2 = µ\_3$ (the three population means are equal)
-   $H_1:$ At least one of the means differ


Test for equality of variance
```
scipy.stats.levene(ratings_df[ratings_df['age_group'] == '40 years and younger']['beauty'],
                   ratings_df[ratings_df['age_group'] == 'between 40 and 57 years']['beauty'], 
                   ratings_df[ratings_df['age_group'] == '57 years and older']['beauty'], 
                   center='mean')
```
**since the p-value is less than 0.05, the variance are not equal, for the purposes of this exercise, we will move along**

LeveneResult(statistic=8.60005668392584, pvalue=0.000215366180993476)

First, separate the three samples (one for each job category) into a variable each.

```
forty_lower = ratings_df[ratings_df['age_group'] == '40 years and younger']['beauty']
forty_fiftyseven = ratings_df[ratings_df['age_group'] == 'between 40 and 57 years']['beauty']
fiftyseven_older = ratings_df[ratings_df['age_group'] == '57 years and older']['beauty']
```
Now, run a one-way ANOVA.

```
f_statistic, p_value = scipy.stats.f_oneway(forty_lower, forty_fiftyseven, fiftyseven_older)
print("F_Statistic: {0}, P-Value: {1}".format(f_statistic,p_value))
```

F_Statistic: 17.597558611010122, P-Value: 4.3225489816137975e-08

**Conclusion:** Since the p-value is less than 0.05, we will reject the null hypothesis as there is significant evidence that at least one of the means differ.

### One Way ANOVA

In [None]:
mod = ols('len~supp', data=df).fit()

We use the `anova_lm()` function from the `statsmodels.stats` package. As noted previously, the function requires a fitted regression model, and for us to specify the type of ANOVA: 1, 2, or 3. You can review the [`statsmodels` documentation](https://www.statsmodels.org/dev/generated/statsmodels.stats.anova.anova_lm.html) to learn more. Since the p-value (column `PR(>F)`) is very small, we can reject the null hypothesis that the mean of the price is the same for all diamond color grades. 

In [None]:
aov_table = sm.stats.anova_lm(mod,typ=2)

In [None]:
aov_table

In [None]:
f_statistic, p_value = scipy.stats.f_oneway(forty_lower, forty_fiftyseven, fiftyseven_older)
print("F_Statistic: {0}, P-Value: {1}".format(f_statistic,p_value))

### Two-way ANOVA

We will prepare a second dataset so we can perform a two-way ANOVA, which requires two categorical variables.

In [None]:
mod1 = ols('len~supp+dose', data=df).fit()

In [None]:
mod1.summary()

Based on the model summary table, many of the color grades' and cuts' associated beta coefficients have a p-value of less than 0.05 (check the `P>|t|` column). Additionally, some of the interactions also seem statistically signifcant. We'll use a two-way ANOVA to examine further the relationships between price and the two categories of color grade and cut.

First, we have to state our three pairs of null and alternative hypotheses:

#### **Null Hypothesis (Color)**

$$H_0: price_D=price_E=price_F=price_H=price_I$$

There is no difference in the price of diamonds based on color.

#### **Alternative Hypothesis (Color)**

$$H_1: \text{Not } price_D=price_E=price_F=price_H=price_I$$

There is a difference in the price of diamonds based on color.

#### **Null Hypothesis (Cut)**

$$H_0: price_{Ideal}=price_{Premium}=price_{Very \space Good}$$

There is no difference in the price of diamonds based on cut.

#### **Alternative Hypothesis (Cut)**

$$H_1: \text{Not } price_{Ideal}=price_{Premium}=price_{Very \space Good}$$

There is a difference in the price of diamonds based on cut.

#### **Null Hypothesis (Interaction)**

$$H_0: \text{The effect of color on diamond price is independent of the cut, and vice versa.}$$

#### **Alternative Hypothesis (Interaction)**

$$H_1: \text{There is an interaction effect between color and cut on diamond price.}$$

In [None]:
aov1 = sm.stats.anova_lm(mod1,typ=2)

In [None]:
aov1

Since all of the p-values (column PR(>F)) are very small, we can reject all three null hypotheses.

## Post hoc test

There are many post hoc tests that can be run. One of the most common ANOVA post hoc tests is the **Tukey's HSD (honestly significantly different) test**. We can import the `pairwise_tukeyhsd()` function from the `statsmodels` package to run the test.

Then we can run the test. The `endog` variable specifies which variable is being compared across groups, which is `log_price` in this case. Then the `groups` variables indicates which variable holds the groups we're comparing, which is `color`. `alpha` tells the function the significance or confidence level, which we'll set to `0.05`. We'll aim for the typical 95% confidence level.

In [None]:
# Run Tukey's HSD post hoc test for one-way ANOVA
tukey_oneway = pairwise_tukeyhsd(endog = diamonds["log_price"], groups = diamonds["color"], alpha = 0.05)

In [None]:
# Get results (pairwise comparisons)
tukey_oneway.summary()

Each row represents a pariwise comparison between the prices of two diamond color grades. The `reject` column tells us which null hypotheses we can reject. Based on the values in that column, we can reject each null hypothesis, except when comparing D and E color diamonds. We cannot reject the null hypothesis that the diamond price of D and E color diamonds are the same.

#### **Test 1: D vs. E**
$$H_0: price_D=price_E$$

The price of D and E color grade diamonds are the same.

$$H_0: price_D \neq price_E$$

The price of D and E color grade diamonds are not the same.

**Result:** We *cannot* reject the null hypothesis that the price of D and E color grade diamonds are the same.

#### **Test 2: D vs. F**
$$H_0: price_D=price_F$$

The price of D and F color grade diamonds are the same.

$$H_0: price_D \neq price_F$$

The price of D and F color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of D and F color grade diamonds are the same.

#### **Test 3: D vs. H**
$$H_0: price_D=price_H$$

The price of D and H color grade diamonds are the same.

$$H_0: price_D \neq price_H$$

The price of D and H color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of D and H color grade diamonds are the same.

#### **Test 4: D vs. I**
$$H_0: price_D=price_I$$

The price of D and I color grade diamonds are the same.

$$H_0: price_D \neq price_I$$

The price of D and I color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of D and I color grade diamonds are the same.

#### **Test 5: E vs. F**
$$H_0: price_E=price_F$$

The price of E and F color grade diamonds are the same.

$$H_0: price_E \neq price_F$$

The price of E and F color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of E and F color grade diamonds are the same.

#### **Test 6: E vs. H**
$$H_0: price_E=price_H$$

The price of E and H color grade diamonds are the same.

$$H_0: price_E \neq price_H$$

The price of E and H color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of E and H color grade diamonds are the same.

#### **Test 7: E vs. I**
$$H_0: price_E=price_I$$

The price of E and I color grade diamonds are the same.

$$H_0: price_E \neq price_I$$

The price of E and I color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of E and I color grade diamonds are the same.

#### **Test 8: F vs. H**
$$H_0: price_F=price_H$$

The price of F and H color grade diamonds are the same.

$$H_0: price_F \neq price_H$$

The price of F and H color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of F and H color grade diamonds are the same.

#### **Test 9: F vs. I**
$$H_0: price_F=price_I$$

The price of F and I color grade diamonds are the same.

$$H_0: price_F \neq price_I$$

The price of F and I color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of F and I color grade diamonds are the same.

#### **Test 10: H vs. I**
$$H_0: price_H=price_I$$

The price of H and I color grade diamonds are the same.

$$H_0: price_H \neq price_I$$

The price of H and I color grade diamonds are not the same.

**Result:** We *can* reject the null hypothesis that the price of H and I color grade diamonds are the same.

### Chi-square

State the hypothesis:

-   $H_0:$ The proportion of teachers who are tenured is independent of gender
-   $H_1:$ The proportion of teachers who are tenured is associated with gender

In [None]:
#Create a Cross-tab table

cont_table  = pd.crosstab(ratings_df['tenure'], ratings_df['gender'])
cont_table

In [None]:
scipy.stats.chi2_contingency(cont_table, correction = True)

In [None]:
chi_square = scipy.stats.chi2_contingency(cont_table, correction = True)

In [None]:
print(f"Chi score is", chi_square[0])

In [None]:
print("P-value is", chi_square[1])

In [None]:
print("Degrees of freedom is", chi_square[2])

### Correlation

State the hypothesis:

-   $H_0:$ Teaching evaluation score is not correlated with beauty score
-   $H_1:$ Teaching evaluation score is correlated with beauty score


In [None]:
pearson_correlation = scipy.stats.pearsonr(ratings_df['beauty'], ratings_df['eval'])

In [None]:
print("Pearson's correlation coefficient is", pearson_correlation[0])

In [None]:
print("P-value is", pearson_correlation[1])

## Correlation and Causation

<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>

<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>

<p>It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler  the determining causation as causation may require independent experimentation.</p>

<p><b>Pearson Correlation</b></p>
<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>
<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>
<ul>
    <li><b>1</b>: Perfect positive linear correlation.</li>
    <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>
    <li><b>-1</b>: Perfect negative linear correlation.</li>
</ul>

<b>P-value</b>

<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>

By convention, when the

<ul>
    <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>
    <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>
    <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>
</ul>


In [None]:
df.corr()

In [None]:
df.corr()["target"].sort_values()

In [None]:
plt.figure(figsize=(16,9))
sns.heatmap(df.corr(),cmap="coolwarm",annot=True,fmt='.2f',linewidths=2)
plt.title("Correlation Heatmap", fontsize=20)
plt.show()

In [None]:
# Plot a correlation heatmap
plt.figure(figsize=(16, 9))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, cmap=sns.color_palette("vlag", as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':14}, pad=12);

**==========================================================================================================**

#### Python code done by Dennis Lam