# Univariate Visualizations

# Uni-variate visualizations are used to understand each attribute in the dataset. In this section, we will look at techniques to understand the "distribution" of each attribute.

# Distribution of a variable is the frequency of each value of the variable. For example, if we have a variable "age", then the distribution of age is the frequency of each age value.

# We will discuss Distributions in much greater detail when we look at Probability and Statistics in a future chapter. For now, we will look at techniques that you can use to understand the distribution of your attributes.

# We will use the Pima Indians Diabetes dataset to demonstrate the uni-variate visualizations. This is a binary classification problem where all of the attributes are numeric and have different scales. It is a great dataset for practicing on with neural networks because all of the inputs are small numeric values.

# We will look at the following techniques in this section:

# - Histograms.
# - Density Plots.
# - Box and Whisker Plots.

# ## Histogram: Ordinal Variables

# Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian, skewed or even has an exponential distribution. It can also help you see possible outliers.

# For a nominal or categorical variable, you can create a bar chart showing the count of each class value.

# Distributions often take on one of the following shapes:

# - Bell
# - Long tail
# - Skewed
# - Twin peaked
# - Flat

# ![Shapes](https://cdn-comlp.nitrocdn.com/jHZYXVQcGrlgKuxnSthsdCgKBQAsIMJC/assets/images/optimized/rev-bfec2c5/wp-content/uploads/2022/10/Different-Type-of-Data-Distributions-1024x577.jpg)

# The different shapes can help you to select the best plots and also suggest that some transforms of your data may be useful prior to modeling.

# The example below creates a histogram plot of the diabetes dataset input variables.

# Running the example creates a histogram plot for each numeric input variable in the dataset. We can see that some attributes may have an exponential distribution, such as attributes 3 and 6.

# We can also see that other attributes may have a skewed distribution such as attributes 1, 5 and 7.

# Finally, perhaps we can see a few attributes that have a Gaussian distribution such as attributes 0, 4 and 8.



# ## Imports 

# We will use the following libraries:

# - Pandas: to load the data.
# - Matplotlib: to create the plots.
# - Seaborn: to create the plots.

# In order to use pandas we need to import it:

# ```python
# import pandas as pd
# ```

# It is common to import pandas as pd. This is not required, but it is a convention that you will see used in other code and documentation, and in these tutorials.

# We will use the `read_csv()` function to load the data. This is a function in pandas that will load data from a CSV file into a DataFrame object.

# ```python
# dataset = pd.read_csv('pima-indians-diabetes.csv')
# ```


# import pandas as pd
# from matplotlib import pyplot as plt
# import seaborn as sns
# %matplotlib inline

# # Set the style to "whitegrid"
# sns.set_theme(style="ticks", color_codes=True)

# The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

# | Column | Description |
# | --- | --- |
# | CRIM | per capita crime rate by town
# | ZN | proportion of residential land zoned for lots over 25,000 sq.ft.
# | INDUS | proportion of non-retail business acres per town.
# | CHAS | Charles River dummy variable (1 if tract bounds river; 0 otherwise)
# | NOX | nitric oxides concentration (parts per 10 million)
# | RM | average number of rooms per dwelling
# | AGE | proportion of owner-occupied units built prior to 1940
# | DIS | weighted distances to five Boston employment centres
# | RAD | index of accessibility to radial highways
# | TAX | full-value property-tax rate per $10,000
# | PTRATIO | pupil-teacher ratio by town
# | B | $1000(B_k - 0.63)^2$ where $B_k$ is the proportion of black population by town
# | LSTAT | % lower status of the population
# | MEDV | Median value of owner-occupied homes in $1000's

# """ Pima Indians Diabetes Database """

# # data = pd.read_csv('../data/pima-indians-diabetes.csv')
# # 
# # """ Boston Housing Dataset """

# column_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
# data = pd.read_csv('../data/housing.csv.xls', header=None, delimiter=r"\s+", names=column_names)
# fig, axs = plt.subplots(ncols=1, nrows=len(column_names), figsize=(5, 50))

# for i, col in enumerate(column_names):
#     sns.histplot(data=data[col], ax=axs[i])
    
#     data = pd.read_csv('../data/adult.data', header=None)
# data.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', \
#     'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

# fig, axs = plt.subplots(ncols=1, nrows=len(data.columns), figsize=(5, 50))
# for i, col in enumerate(data.columns):
#     sns.histplot(data=data[col], ax=axs[i])
    
#     By default, `sns.histplot` will try to infer the bin edges from the data. However, it is possible to set the bin edges explicitly. This can be useful when comparing multiple distributions.

# ```python  
# sns.histplot(data=dataset, x="age", bins=range(0, 100, 5))
# ```

# # Histogram of BMI
# sns.histplot(data, x="BMI", bins=10);
# plt.title('BMI distribution');
# plt.ylabel('Count (of rows)');
# plt.grid();



# # Plotting the data
# sns.pairplot(data, hue='Outcome', diag_kind='kde');

# # Plotting the data
# sns.histplot(data, x="BMI", hue="Outcome", multiple="stack");
# plt.title('BMI distribution by Outcome');

# from matplotlib import pyplot as plt
# import seaborn as sns

# sns.set_theme()

# fig = sns.histplot(data['IMDB_Rating']);
# fig.set(title='');

# ## Histogram: Nominal Variables

# fig = sns.histplot(data['Certificate'])
# plt.xticks(rotation=90);

# ## Title

# fig = sns.histplot(data['Certificate'])
# plt.xticks(rotation=90);
# plt.title("Count of each Certificate in top 1000 IMDb movies");

# ## Legend
# # 
# # top    = data[:500]
# bottom = data[500:]

# sns.histplot(top['Certificate'], label='Top 500', alpha=0.5)
# sns.histplot(bottom['Certificate'], label='Bottom 500', color='purple', alpha=0.5)
# plt.xticks(rotation=90);
# plt.title("Count of each Certificate in top 1000 IMDb movies")
# plt.legend();

# ## Boxplot
# # 
# # sns.boxplot(x=data["Meta_score"]);
# plt.title('Ratings on Metacritic for top 1000 IMDb movies');

# ## Violin Plot
# # 
# # sns.violinplot(x=data["Meta_score"]);
# plt.title('Ratings on Metacritic for top 1000 IMDb movies');

# ## Subplots

# fig, axes = plt.subplots(2, 2, figsize=(10, 10))

# sns.histplot(data['IMDB_Rating'], ax=axes[0][0]);
# sns.histplot(data['Meta_score'], ax=axes[0][1]);
# sns.boxplot(x=data['IMDB_Rating'], ax=axes[1][0]);
# sns.boxplot(x=data['Meta_score'], ax=axes[1][1]);

# axes[0][0].set_title('IMDb Rating', fontsize=20);
# axes[0][1].set_title('Meta score', fontsize=20);