# Data Visualization using Python

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from scipy.stats import linregress

AttributeError: module 'matplotlib.cm' has no attribute 'register_cmap'

In [None]:

print("Seaborn version:", sns.__version__)

|  X1  |  X2  |  X3  |   Y   |
|------|------|------|-------|
|  10  |  5   |  20  |  500  |
|   8  |  2   |  15  |  400  |
|  12  |  6   |  18  |  550  |
|   6  |  3   |  10  |  350  |
|   9  |  4   |  22  |  480  |

|                | <span style="color:yellow">Feature Selection       | <span style="color:yellow">Feature Correlation     </span>                                                 |
|----------------|------------------------------------------------------------------|------------------------------------------------------------------------|
| Purpose        | Select a subset of relevant features from a larger set          | Measure the statistical relationship between features                     |
| Goal           | Improve model performance, reduce complexity, enhance interpretability            |Understand dependencies, identify collinear features  |
| Focus          | Selection of informative and discriminative features                     | Measurement of association between variables                   |
| Outcome        | Reduced dimensionality, improved model performance    | Insights into relationships, identification of collinearity |
| Techniques     | Filter methods, wrapper methods, embedded methods |  Correlation coefficients, such as Pearson's correlation coefficient |
| Considerations | Relevance to the target variable, redundancy, importance             | Strength and direction of association, collinearity                |
| Impact         | Improved model accuracy, efficiency, interpretability         | Insights into feature relationships, data preprocessing                  |

These two concepts are not mutually exclusive. Feature correlation can be considered as a factor in the feature selection process, but feature selection involves broader considerations beyond just correlation. The choice and combination of feature selection techniques and correlation analysis depend on the specific problem, dataset, and goals of the analysis.

|                   | <span style="color:yellow">Correlation between Independent Variables</span>               | <span style="color:yellow">Correlation between Independent and Target Variables</span>  |
|--------------------------|-------------------------------------------------------|-----------------------------------------------------|
| Definition               | Measures the relationship between two or more independent variables.  | Measures the relationship between an independent variable and a target variable. |
| Purpose                  | Identify patterns and dependencies among independent variables. | Assess the predictive power or influence of a feature on the target variable. |
| Type of Relationship     | Linear relationship between independent variables.    | Linear relationship between independent and target variables. |
| Measurement Range        | -1 (perfect negative correlation) to +1 (perfect positive correlation). | -1 (perfect negative correlation) to +1 (perfect positive correlation). |
| Interpretation           | - Positive correlation: Variables move together in the same direction. <br> - Negative correlation: Variables move in opposite directions. <br> - No correlation: Variables have no linear relationship. | - Positive correlation: Increase in independent variable associated with increase in target variable. <br> - Negative correlation: Increase in independent variable associated with decrease in target variable. <br> - No correlation: Independent variable has no linear relationship with target variable. |
| Application              | Feature correlation, identifying redundant or collinear features. | Feature selection, understanding predictors' impact on the target, model building. |

## Scatter Plot

1. A scatter plot is a visual representation of the relationship between two variables in a dataset, where each data point is represented as a point on the plot.
    -   The x-axis of the scatter plot represents one variable, while the y-axis represents the other variable.
    -   Each point on the scatter plot corresponds to a specific combination of values for the two variables, allowing for a visual examination of their joint behavior.
2. Scatter plots can reveal different types of relationships between variables:
    -   <span style="color:yellow">Positive correlation</span> is indicated when the points on the scatter plot tend to form an upward trend, suggesting that as one variable increases, the other variable also tends to increase.
    -   <span style="color:yellow">Negative correlation</span> is observed when the points on the scatter plot tend to form a downward trend, indicating that as one variable increases, the other variable tends to decrease.
    -   <span style="color:yellow">A lack of correlation</span> is depicted by points scattered randomly across the plot, indicating no significant relationship between the variables.
3. Scatter plots are effective for exploring and understanding the correlation, patterns, and trends between two numerical variables, and identify potential patterns or outliers.

#### Positive correlation
Some examples:
-   The more hours a student studies, the higher their exam score tends to be.
-   As the temperature increases, the sales of ice cream also tend to increase.
-   The more years of experience a person has, the higher their salary tends to be.

It can be useful for:

-   In Feature selection: identifying positively correlated features can help in selecting relevant features for predictive models, as they may provide similar information. If two or more features are highly positively correlated, keeping only one of them may be sufficient for model training while still capturing the underlying relationship.

In [None]:
# Sample data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [30, 40, 50, 60, 70, 80, 90, 95, 98, 100]

# Creating scatter plot
plt.scatter(x, y, color='blue')

# Customizing the plot
plt.title("Scatter Plot - Positive Correlation")
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")

# Displaying the x and y values on the ticks
plt.xticks(x)
plt.yticks(y)

# Adding a grid
plt.grid(True, linewidth=0.2)

#### Negative correlation
Some examples:
-   As the amount of exercise increases, body weight tends to decrease.
-   The more time spent commuting to work, the lower job satisfaction tends to be.
-   As the temperature drops, heating costs tend to increase.

It can be useful for:

- Feature selection: identifying negatively correlated features can help in avoiding redundant or collinear features during the feature selection process. When two features are highly negatively correlated, it suggests that they provide similar information, and including both of them in the model may not contribute much additional information. In such cases, selecting only one of the negatively correlated features can help improve model interpretability and reduce overfitting.
- Anomaly detection: negative correlations can help detect unusual patterns or outliers in the data, where one variable behaves differently from the expected relationship.

In [None]:
# Sample data
temperature = [15, 12, 10, 8, 5, 2, 0, -3, -5, -8]
heating_costs = [8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000]

# Creating scatter plot
plt.scatter(temperature, heating_costs, color='blue')

# Customizing the plot
plt.title("Scatter Plot - Negative Correlation")
plt.xlabel("Temperature (°C)")
plt.ylabel("Heating Costs (JPY)")

# Displaying the x and y values on the ticks
plt.xticks(temperature)
plt.yticks(heating_costs)

# Adding a grid
plt.grid(True, linewidth=0.2)

#### Lack of correlation
Some examples:
-   There is no clear relationship between shoe size and intelligence.
-   There is no correlation between the number of hours slept and shoe size.
-   The amount of rainfall does not affect the price of stocks.

It can be useful for:

-   Feature Selection: identifying variables with no correlation to the target variable can help in excluding irrelevant features that do not contribute to the predictive power of the model.

In [None]:
# Sample data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [100, 95, 98, 105, 102, 98, 100, 105, 101, 99]

# Creating scatter plot
plt.scatter(x, y, color='blue')

# Customizing the plot
plt.title("Scatter Plot - Lack of Correlation")
plt.xlabel("Amount of Rainfall")
plt.ylabel("Stock Price (USD)")

# Displaying the x and y values on the ticks
plt.xticks(x)
plt.yticks(y)

# Adding a grid
plt.grid(True, linewidth=0.2)

### Improve the Scatter plot

In [None]:
# Generate random data from a normal distribution
np.random.seed(0)
x = np.random.normal(loc=0, scale=1, size=100)
y = np.random.normal(loc=0, scale=1, size=100)

# Create a scatter plot, increase the size the markers
plt.scatter(x, y, color='blue', alpha=0.5, s=80)

# Customize the plot
plt.title('Scatter Plot', fontsize=16, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)


# Adjust the spacing and layout
plt.tight_layout()

In [None]:
# Generate random data from a normal distribution
np.random.seed(0)
x = np.random.normal(loc=0, scale=1, size=100)
y = np.random.normal(loc=0, scale=1, size=100)

# Add Transparency Gradient
alpha = np.linspace(0.2, 1, len(x))

# Create a scatter plot
plt.scatter(x, y, color='blue', alpha=alpha, s=80)

# Customize the plot
plt.title('Scatter Plot', fontsize=16, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)


# Adjust the spacing and layout
plt.tight_layout()

In [None]:
# Set figure size
plt.figure(figsize=(8, 6))

# Generate random data from a normal distribution
np.random.seed(0)
x = np.random.normal(loc=0, scale=1, size=100)
y = np.random.normal(loc=0, scale=1, size=100)

# Generate a third variable contains the value of each data point
z = np.random.rand(len(x))

# Set marker color based on z
plt.scatter(x, y, c=z, cmap='coolwarm', s=80)

# Add colorbar
cbar = plt.colorbar()
cbar.set_label('Color')

# Customize the plot
plt.title('Scatter Plot', fontsize=16, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)

# Adjust the spacing and layout
plt.tight_layout()


In [None]:
# Set figure size
plt.figure(figsize=(8, 6))

# Generate random data from a normal distribution
np.random.seed(0)
x = np.random.normal(loc=0, scale=1, size=100)
y = np.random.normal(loc=0, scale=1, size=100)

# Generate a third variable contains the value of each data point
z = np.random.rand(len(x))

# Set marker color and size based on z
plt.scatter(x, y, c=z, cmap='coolwarm', s=100*z)

# Add colorbar
cbar = plt.colorbar()
cbar.set_label('Color')

# Customize the plot
plt.title('Scatter Plot', fontsize=16, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)

# Adjust the spacing and layout
plt.tight_layout()


In [None]:
# Set figure size
plt.figure(figsize=(8, 6))

# Generate random data from a normal distribution
np.random.seed(0)
x = np.random.normal(loc=0, scale=1, size=100)
y = np.random.normal(loc=0, scale=1, size=100)

# Generate a third variable contains the value of each data point
z = np.random.rand(len(x))

# Create a scatter plot using Seaborn
sns.scatterplot(x=x, y=y, hue=z, palette='coolwarm', size=z, sizes=(50, 500), zorder=2)


# Customize the plot
plt.title('Scatter Plot', fontsize=16, fontweight='bold')
plt.xlabel('X', fontsize=12)
plt.ylabel('Y', fontsize=12)
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5)

# Adjust the spacing and layout
plt.tight_layout()

#### For Iris dataset

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a scatter plot using Seaborn
sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=iris, palette='Set2')

# Customize the plot
plt.title('Iris Dataset - Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')

## Pair Plot

1. The pairplot function takes a DataFrame as input and produces a grid of subplots, with each subplot showing the scatter plot between two variables.
2. The diagonal subplots of the pairplot display the distribution of each variable separately.
3. The off-diagonal subplots show the scatter plots between pairs of variables, allowing for the examination of the relationships between different combinations of variables.
4. By specifying additional parameters, such as the 'hue' parameter, the pairplot can be used to differentiate the data points based on a categorical variable. This allows for the visualization of different groups or categories in the dataset.
5. Pairplots are particularly useful in exploratory data analysis as they provide a comprehensive overview of the relationships between multiple variables in a dataset.
6. They can help identify patterns, correlations, and potential outliers in the data.

In [None]:
iris = sns.load_dataset('iris')
iris

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create the correlation plot with Pearson correlation
sns.pairplot(iris, hue='species', palette='Set2', markers=["o", "s", "D"])


## PairGrid

1. The PairGrid class in seaborn provides a flexible way to create a grid of subplots for pairwise relationships between variables in a dataset.
2. Unlike pairplot, PairGrid gives you more control over the layout and customization of the plots.
3. With PairGrid, you can specify different types of plots for the upper triangle, lower triangle, and diagonal cells of the grid.
4. The upper triangle typically shows scatter plots or other types of plots that visualize the relationship between two variables.
5. The lower triangle can display different types of plots, such as scatter plots, heatmaps, or kernel density estimate (KDE) plots, for visualizing the distribution or density of data.
6. The diagonal cells are usually used to plot histograms or other univariate plots for each variable separately.
7. PairGrid allows you to map different plotting functions to the different parts of the grid, giving you the flexibility to choose the most appropriate plot type for each position.

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a PairGrid
g = sns.PairGrid(iris, hue='species', palette='Set2')

# Scatterplot for all cells
g.map(sns.scatterplot)

# # Add a legend
# g.add_legend()

# Customize the plot
g.figure.suptitle('Iris Dataset - PairGrid', fontsize=16, fontweight='bold')
g.figure.tight_layout()


In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a PairGrid
g = sns.PairGrid(iris, hue='species', palette='Set2')

# # Histograms on the diagonal
g.map_diag(sns.histplot, kde=True)

# Scatterplots on the off-diagonal cells
g.map_offdiag(sns.scatterplot)

# Add a legend
# g.add_legend()

# Customize the plot
g.figure.suptitle('Iris Dataset - PairGrid', fontsize=16, fontweight='bold')
g.figure.tight_layout()

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a PairGrid
g = sns.PairGrid(iris, hue='species', palette='Set2')

# Scatterplot in the upper triangle
g.map_upper(sns.scatterplot)

# Kernel density estimate plot in the lower triangle
g.map_lower(sns.kdeplot)

# Histograms on the diagonal
g.map_diag(sns.histplot, kde=True)

# # Add a legend
# g.add_legend()

# Customize the plot
g.figure.suptitle('Iris Dataset - PairGrid', fontsize=16, fontweight='bold')
g.figure.tight_layout()

In [None]:
# Load the Iris dataset
iris = sns.load_dataset('iris')

# Create a PairGrid
g = sns.PairGrid(iris, hue='species', palette='Set2', diag_sharey=False, corner=True)

# Scatterplots on the lower triangle
g.map_lower(sns.scatterplot)

# KDE plots on the diagonal
g.map_diag(sns.kdeplot)

# Add a legend
g.add_legend()

# Customize the plot
g.fig.suptitle('Iris Dataset - PairGrid', fontsize=16, fontweight='bold')
g.fig.tight_layout()

In [None]:
pip install ptitprince

https://github.com/pog87/PtitPrince/blob/master/tutorial_python/raincloud_tutorial_python.ipynb

In [None]:
pip install seaborn==0.11

In [None]:
import ptitprince as pt
import seaborn as sns
iris = sns.load_dataset('iris')

In [None]:
ort = "h"; pal = "Set2"; sigma = .2
f, ax = plt.subplots(figsize=(7, 5))

pt.RainCloud(x = 'species', y = 'sepal_width', hue='species', data = iris, palette = pal, bw = sigma,
                 width_viol = .6, ax = ax, orient = ort, move = .2)

In [None]:
ort = "v"; pal = "Set2"; sigma = .2
f, ax = plt.subplots(figsize=(7, 5))

pt.RainCloud(x = 'species', y = 'sepal_width', hue='species',data = iris, palette = pal, bw = sigma,
                 width_viol = .6, ax = ax, orient = ort, move = .2)

In [None]:
ort = "h"; pal = "Set2"; sigma = .15
f, ax = plt.subplots(figsize=(7, 5))

pt.RainCloud(x = 'species', y = 'sepal_width', hue='species', data = iris, palette = pal, bw = sigma,
                 width_viol = .6, ax = ax, orient = ort, move = .2)

In [None]:
ort = "h"; pal = "Set2"; sigma = .2
f, ax = plt.subplots(figsize=(7, 5))

pt.RainCloud(x = 'species', y = 'sepal_width', data = iris, palette = pal, bw = sigma,
                 width_viol = .6, ax = ax, orient = ort, move = .2,  pointplot = True)

In [None]:
ort = "v"; pal = "Set2"; sigma = .2
f, ax = plt.subplots(figsize=(7, 5))

pt.RainCloud(x = 'species', y = 'sepal_width', data = iris, palette = pal, bw = sigma,
                 width_viol = .6, ax = ax, orient = ort, move = .2,  pointplot = True)