<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br><br><br>
<h1>Python for Business Analytics</h1>
<em>A Nontechnical Approach for Nontechnical People</em><br><br>
<em><strong>Custom Edition for Hult International Business School</strong></em><br>

Written by Chase Kusterer - Faculty of Analytics <br>
Hult International Business School <br>
<a href="https://github.com/chase-kusterer">https://github.com/chase-kusterer</a>
<br><br><br><br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h1>Chapter 13: Method Chaining and Missing Values</h1>

Analysts and data scientists spend a considerable amount of time cleaning data. This includes aspects that were covered in <strong>Chapter 12: Analyzing DataFrame Quality</strong>, and extends into the concepts of this chapter. Here, we will expand on our analysis of DataFrame quality from the perspective of missing values. Missing values plague nearly every quantitative analysis, and a vast number of techniques exist in order to address these anomalies. This is a topic worth exploring in immense detail, primarily because:

<br><div align="center">
<strong>If a value is truly missing, we have no way of knowing what its exact value should be.</strong>
    <a class="tocSkip"></a></div><br>

Take, for example, dinosaurs. If you had to draw a picture of your favorite dinosaur, what colors would you use? Odds are you would choose colors such as brown and green. However, any colors you pick would be valid. Given that dinosaurs are extinct, are you absolutely sure that dinosaurs weren't pink or purple? We have no way of knowing with 100% confidence, and our job as analysts is to use available information to make our best guess.
<br><br>
Missing values cause another perplexing challenge:

<br><div align="center">
<strong>Since we do not know their exact values, we have no way to audit the results of <a href="https://www.investopedia.com/terms/i/imputed-value.asp">imputation</a> strategies.</strong>
    <a class="tocSkip"></a></div><br>
    
Countless techniques have been developed to address missing values, with the most common presented below:

1. Drop them from the dataset.
2. Engineering flag features.
3. Impute using a measure of center.
4. Develop an algorithm based on available information.

<br>
<h4>Dropping Missing Values</h4>
Considering that we have no way of knowing what their exact values should be and are unable to audit the results of an imputation strategy, it may seem wise to simply drop missing values from the dataset. Although this seems like a safe bet, <font style="color:red"><strong>this strategy should only be used as a last resort.</strong></font> The following exemplifies the rationale for this:

<br><br><div style = "width:image width px; font-size:80%; text-align:center;"><img src="./__images/chapter-13-diamonds-mv-example.png" width="250" height="200" style="padding-bottom:0.5em;"> <em>Figure 13.1: Missing value example.</em></div>

<br>
<em>Figure 13.1</em> shows a subset from the <em>diamonds</em> dataset where the values for <em>carat</em> are missing (represented with <em>NaN</em>, which stands for not a number). If missing values were dropped from the dataset, all other information for these observations would also be removed. In other words, we would lose the respective values for <em>price</em>, <em>color</em>, <em>clarity</em>, and <em>cut</em>. Such a practice can cause severe distortions, preventing the data from speaking for itself.
<br><br>
<br>
<h4>Engineering Flag Features</h4>
<strong>Feature engineering</strong> is the process of developing new features based on discoveries found in the original data. One, often overlooked area of feature engineering is to develop <strong>missing value flags</strong>. Missing value flags are a feature made of ones and zeros, with one representing an original missing value. In other words, a missing value flag is a new column of data based on whether or not the data in an existing column is missing (1 == True, 0 == False). These can be very useful to preserve the integrity of the original dataset and can even be used in predictive modeling. Also, it is important to keep in mind that:

<br><div align="center">
<strong>The reason behind why a value is missing may be insightful.</strong>
    <a class="tocSkip"></a></div><br>

A good example comes from an analysis I conducted on a baseball dataset, where several values for the <em>Hit By Pitch</em> column were missing. As its name implies, a <strong>hit by pitch (HBP)</strong> occurs when a batter gets hit by a pitch, resulting in the batter getting a free base. I decided to impute these missing values with the mean or median and move forward with my analysis. Later that evening, I started reading the history of baseball and noticed a pattern to the missing values: In the early days of baseball, a hit by pitch was considered a ball, not a free base. For this reason, HBP was not recorded, leading to missing values. In later years, the HBP rule was implemented, implying that the missing values for this feature represent the transition to a new era of baseball. For this reason, I decided to flag these missing values as they represented a significant event.
<br><br>
I also discovered that before the HBP rule, baseball was played in a completely different way. For instance, the early days of baseball were much less competitive. Batters would shout out the kind of pitch they wanted and the pitcher would comply. The alleged rationale for this was that fans preferred higher scoring games. Domain knowledge is key, and techniques such as flagging missing values can help make use of domain knowledge.
<br><br>
<h4>Imputing with a Measure of Center</h4>
Simply put, when an observation is unknown and there is no other available information, imputing with a measure of center (mean, median, etc.) is our most pragmatic approach, assuming a normal distribution. In such cases, statistics is our best friend. Given enough samples, we will converge on the true value of the population mean. This implies that imputing with a measure of center will result in a minimization of error between the imputation strategy and the true (unknown) value of the missing data. In other words, we will be less wrong if we impute with a measure of center than if we impute with any other value.
<br><br>
<h4>Developing an Algorithm based on Available Information</h4>
This strategy becomes more of a focal point after developing a solid understanding of predictive modeling and machine learning. For now, note that given additional information (highly-correlated features, categorical data, etc.), algorithms can be developed to make (potentially) better imputations for missing data. For example, if we were working with country-level data from <a href="https://www.worldbank.org/en/home">The World Bank</a> and encountered missing values, we may achieve better results by categorizing countries based on their level of development before imputation.
<br><br>
Additional information on algorithms for missing value imputation can be found in <a href="https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute">Python's official scikit-learn documentation</a>.

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part I: Getting Started with Method Chaining</h2><br>
As mentioned in <strong>Chapter 02: Printing, Dynamic Strings, and Escape Sequences</strong>, <strong>method chaining</strong> is the process of linking methods together using dots. In Jupyter Notebook, all methods available for a given object can be accessed by pressing <em>tab</em> on your keyboard.
<br><br>
Method chaining is a very useful technique for missing value detection. We can also benefit from visualizing the distributions of features affected by missing values with the <em>matplotlib</em> and <em>seaborn</em> packages. Let's begin by importing these, as well as the <em>diamonds</em> dataset.
<br>
<h4>Practice - Importing packages and data.</h4>
Import <strong>pandas</strong> as <strong>pd</strong>. Then, complete the code to adjust data types and import <strong>diamonds_missing_values.xlsx</strong> as <em>diamonds</em>.

In [None]:
# Code 13.1.1

# importing packages
import ____ as pd # data science essentials
import matplotlib.pyplot as plt # NEW: data visualization essentials
import seaborn as sns # NEW: enhanced data visualization


# converting data types with a dictionary
data_types = {"Obs"     : ____,
              "channel" : ____,
              "store"   : ____}


# specifying a file (must specify path to datasets folder)
file = ____


# reading the file into Python through pandas
diamonds = ____


# printing the first 5 rows of the dataset
print(diamonds.head(n = 5))


In [None]:
# Sample Solution 13.1.1

# importing packages
import pandas as pd # data science essentials
import matplotlib.pyplot as plt # NEW: data visualization essentials
import seaborn as sns # NEW: enhanced data visualization


# converting data types with a dictionary
data_types = {"Obs"     : str,
              "channel" : str,
              "store"   : str}


# specifying a file (must specify path to datasets folder)
file = './__datasets/diamonds_missing_values.xlsx'


# reading the file into Python through pandas
diamonds = pd.read_excel(io         = file,
                         sheet_name = 'missing_diamonds',
                         header     = 0,
                         dtype      = data_types)


# checking the first 5 rows of the dataset
diamonds.head(n = 5)


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br><br>
The following is an example of a method chain that returns <em>True</em> or <em>False</em>, depending on whether or not a feature contains missing values. It can be interpreted as follows:
<br>

* Take the diamonds dataset,
* <strong>and then</strong> check to see if a value <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html">is_null(&nbsp;)</a>,
* <strong>and then</strong>, aggregate to display whether or not <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html">any(&nbsp;)</a> null values exist per feature.

In [None]:
# Code 13.1.2

# method chaining!
print(diamonds.isnull().any(axis = 0))


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

At this point in your Python journey, method chaining may seem confusing. You are very likely to encounter method chains in sample codes that you find online, so it is very important to understand what Python is trying to do when encountering syntax such as in the code above. When in doubt, a good strategy is to break the chain down step by step.
<br><br>
<strong>Step 1:</strong> Printing the dataset.

In [None]:
# Code 13.1.3 (a)

# Step 1
print(diamonds)


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Step 2:</strong> Taking the diamond dataset, <strong>and then</strong> checking for null values with <em>isnull(&nbsp;)</em>.

In [None]:
# Code 13.1.3 (b)

#Step 2
print(diamonds.isnull())


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<strong>Step 3:</strong> Taking the results of <em>isnull(&nbsp;)</em>, <strong>and then</strong> aggregating column-wise with <em>any(&nbsp;)</em> to see which features are affected by missing values.

In [None]:
# Code 13.1.3 (c)

# Step 3
print(diamonds.isnull().any(axis = 0))


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

The thinking behind developing a method chain also starts in a step-by-step fashion. Experience is key, and getting started can be a challenge. It is important to keep in mind what you are trying to achieve and conduct research on methods as necessary. Try the following exercise for practice.
<br><br>
<h4>Practice - a) Take the diamonds dataset.</h4>
Make sure to include a <em>print(&nbsp;)</em> wrapper.

In [None]:
# Code 13.1.4 (a)

# taking the diamonds dataset
____


In [None]:
# Sample Solution 13.1.4 (a)

# method chaining!
print(diamonds)


<br>
<h4>Practice - b) Extend the method chain to output <em>True</em> or <em>False</em> based on whether or not a data point is missing.</h4>
Make sure to include a <em>print(&nbsp;)</em> wrapper.

In [None]:
# Code 13.1.4 (b)

# checking each data point for missing values
____


In [None]:
# Sample Solution 13.1.4 (b)

# checking each data point for missing values
print(diamonds.isnull())


<br>
<h4>Practice - c) Extend the method chain to <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html">sum(&nbsp;)</a> missing values per feature.</h4>
Make sure to include a <em>print(&nbsp;)</em> wrapper.

In [None]:
# Code 13.1.4 (c)

# summing missing values per feature
____


In [None]:
# Sample Solution 13.1.4 (c)

# summing missing values per feature
print(diamonds.isnull().sum(axis = 0))


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h3>The Rare Event Rule</h3><br>
The rare event rule is a statistical heuristic that can be summed up as follows:
<br><br><br>
<div align="center"><strong>
    If a phenomenon occurs less than 5% of the time, it is considered rare and is unlikely to occur.
    </strong><a class="tocSkip"></a></div><br><br>
In terms of missing values, this rule is often interpreted as:
<br><br><br>
<div align="center"><strong>
    If missing values affect less than 5% of observations, then they are not a problem worthy of heavy focus in our analysis.
    </strong><a class="tocSkip"></a></div><br><br>
In other words, if missing values are rare, then we can likely remove them or use a simple method to fill in their values without significantly affecting the overall distribution of each feature. Before moving forward, <strong><font style="color:red">note that this is a dangerous heuristic that we will not be following.</font></strong> This is primarily driven by two factors:

1. The heuristic used above can be misleading in terms of the actual number of observations being affected by missing values.
2. Tremendous value can be generated from missing values, not only in terms of domain knowledge, but also in terms of prediction.

<br>
To begin exploring this, let's flag each data point where a missing value is present.

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h2>Part II: Flagging Missing Values</h2>

Our objective in flagging missing values is to create a column of ones and zeros, where one represents that a data point was originally missing and zero the opposite. We can utilize method chaining to accomplish this task.

<h4>Practice - Develop a method chain that:</h4>

* Takes the <em>carat</em> column
* Checks each data point for missing values (True or False)
* Converts the True/False data to integers.

<br><br>
<strong><u>Tips</u></strong>

1. Remember that you can build a method chain step by step and check your results along the way.
2. Don't forget that you can access all available methods for an object by pressing the <em>Tab</em> key right after a dot.

In [None]:
# Code 13.2.1

# instantiating a missing value flag for carat
diamonds['m_carat'] = ____


In [None]:
# Sample Solution 13.2.1

# instantiating a missing value flag for carat
diamonds['m_carat'] = diamonds.loc[ : , 'carat'].isnull().astype(int)


<br>

In [None]:
# Code 13.2.2

print(f"""
Original Missing Value Counts:
------------------------------
{diamonds.loc[ : , 'carat'].isnull().sum(axis = 0)}


Sums of Missing Value Flags
--------------------------
{diamonds.loc[ : , 'm_carat' ].sum(axis = 0)}

""")


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h3>Creating New Columns in a DataFrame</h3><br>
It appears that all missing values are accounted for by the flag features. Note that the code above is creating new columns directly in the dataset to store the flag features. In general, the syntax:
<br><br>

~~~
diamonds['m_carat'] = diamonds['carat'].isnull().astype(int)
~~~

<br>
Can be generalized as follows:
<br><br>

~~~
DATAFRAME['NEW_COLUMN_NAME'] = DATA TO BE POPULATED INTO THE NEW COLUMN
~~~

<br>
The new columns will be attached to the end of the DataFrame, allowing us to use a combination of <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">columns</a> and <strong>.iloc[ ]</strong> to access their data.

<br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />

In [None]:
# Code 13.2.3

diamonds.columns


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h3>Special Gift: Automatically Flagging Missing Values</h3><br>
When working with larger datasets, it may be infeasible to hard code each flag feature. Thus, the following loop template has been developed in order to improve your efficiency. This loop took several hours to code when it was originally developed, and since then, I have been able to apply it to countless analysis projects. Now, it is my gift to you. Welcome to open source!<br><br>

~~~
# developing a loop to automatically flag missing values
for col in DATAFRAME:
    
    if DATAFRAME[col].isnull().astype(int).sum() > 0:
        DATAFRAME['m_'+col] = DATAFRAME[col].isnull().astype(int)
~~~

<br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h4>Practice - Complete the code so that the loop flags missing values for all features.</h4>

In [None]:
# Code 13.2.4

# soft coding :)
DATAFRAME = ____


# developing a loop to automatically flag missing values
for col in diamonds:

    if DATAFRAME[col].isnull().astype(int).sum() > 0:
        DATAFRAME['m_'+col] = DATAFRAME[col].isnull().astype(int)


#
print(diamonds.columns)


In [None]:
# Sample Solution 13.2.4

# soft coding :)
DATAFRAME = diamonds


# developing a loop to automatically flag missing values
for col in diamonds:

    if DATAFRAME[col].isnull().astype(int).sum() > 0:
        DATAFRAME['m_'+col] = DATAFRAME[col].isnull().astype(int)


# printing results
print(diamonds.columns)


<br>

In [None]:
# Code 13.2.5

print(f"""
Original Missing Value Counts:
------------------------------
{diamonds.isnull().sum(axis = 0)}


Sums of Missing Value Flags
--------------------------
{diamonds.iloc[ : , -4: ].sum(axis = 0)}

""")


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h4>Practice - Create a new column that sums the values for each of the missing value flags.</h4>
Our goal here is to better understand how many total observations are affected by missing values.

In [None]:
# Code 13.2.6

# MAKE YOUR COMMENT HERE
diamonds['mv_sum'] = ____


# checking results
print(f"""

Number of Missing Values per Observation (Pct)
----------------------------------------------
{(diamonds['mv_sum'].value_counts(normalize = True,
                                  sort      = True,
                                  ascending = True)*100).round(2)}
""")


In [None]:
# Sample Solution 13.2.6

# creating a column to sum missing value flags
diamonds['mv_sum'] = diamonds['m_carat'] + \
                     diamonds['m_color'] + \
                     diamonds['m_clarity'] + \
                     diamonds['m_cut']


# checking results
print(f"""

Number of Missing Values per Observation (Pct)
----------------------------------------------
{(diamonds['mv_sum'].value_counts(normalize = True,
                                sort = True,
                                ascending = True)*100).round(2)}
""")


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h2>Part III: Imputing Missing Values</h2>

Let's attempt three basic missing value strategies:
1. Imputing with the mean
2. Imputing with the median
3. Dropping missing values from the dataset

<br>
Generally, it is a best practice to develop a distinct missing value strategy for each feature containing such anomalies. Primarily, this is done to minimize the amount of undue variance created through imputation. <strong><font style="color:red">Our goal is to <a href="https://www.lexico.com/en/definition/impute">impute</a> missing values with the approach that distorts the natural distribution of the original data the least.</font></strong> Before performing such a task, however, let's visually analyze each distribution to help determine which strategy will fit the original data the best. In order to do so, we should first drop missing values from the dataset so that they do not cause undue complications when generating plots.
<br><br>
<strong>Note:</strong> There are countless techniques for imputing missing values, and this is a topic that is worth researching in great detail.
<br><br>
<strong>Note:</strong> There is a technique you may encounter called jittering. Although you may find resources claiming that this is a valid technique for imputing missing values, this could not be further from the truth. Jittering was designed for a specific problem in data visualization. It will add undue variance to your data if used in imputation.
<br><br>
<strong>Note:</strong> In most cases, data visualization methods in Python are robust in their handling of missing values. However, by not explicitly stating what to do when encountering such anomalies, we are asking Python to make assumptions behind the scenes, which may not be in the best interest of what we are trying to accomplish.

In [None]:
# Code 13.3.1

# the following code makes the new DataFrame independent
df_dropped = pd.DataFrame.copy(diamonds)


# using dropna() for df_dropped
df_dropped = df_dropped.dropna().round(2)


# checking to see if all missing values have been dropped
print(df_dropped.isnull().sum())


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

<h3>Visualizing Data in Python</h3><br>
Most visualization in Python relies on the <em>matplotlib.pyplot</em> package. We will be utilizing this as well as the <em>seaborn</em> package (&nbsp;which was built on top of <em>matplotlib.pyplot</em>&nbsp;). For our purposes, the role of <em>seaborn</em> is to produce more advanced and aesthetically-pleasing visuals. If you are a highly analytical person, focusing on visualization aesthetics may seem bizarre. After all, <font style="color:red"><strong>just like with humans, beauty is on the inside.</strong></font> Your findings from the data are what matters. If your findings are not very impactful, improving aesthetical elements will do no good in terms of addressing the problem you are trying to solve. This would be the equivalent to putting <a href="https://idioms.thefreedictionary.com/lipstick+on+a+pig">lipstick on a pig</a>. Keep in mind, however, that in a business setting, it is still somewhat common that a bad idea that looks fancy will beat a great idea that looks ugly. 
<br><br>
The following code will generate a histogram for <em>carat</em> after dropping missing values. Our goal in this step is to analyze the original distribution of this feature in order to develop an imputation strategy. Also notice that the visual contains a title and axis labels. These is very important and should <strong>always</strong> be included in every visual you create.

In [None]:
# Code 13.3.2

# histogram for carat
sns.histplot(data  = df_dropped,
             x     ='carat',
             bins  = 'fd',
             kde   = False,
             color = 'black')


# this adds a title
plt.title(label = "Distribution of Carat Weight")


# this adds an x-label
plt.xlabel(xlabel = 'Carat Weight')


# this add a y-label
plt.ylabel(ylabel = 'Frequency')


# these compile and display the plot so that it is formatted as expected
plt.tight_layout()
plt.show()


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h3>The Optimal Number of Bins - The Freedman-Diaconis Rule</h3><br>
Freedman and Diaconis wrote <a href="http://math.sjtu.edu.cn/faculty/chengwang/files/2015fall/1.pdf">a statistical heuristic</a> to help determine the optimal number of bins in a histogram. Technically speaking, this rule tries to minimize the difference between a feature's theoretical probability distribution and the one observed from the data (i.e., the empirical probability distribution). In other words, this method tries to create an optimal number of bins to fit with the science of statistics as well as the distribution of the phenomenon represented by the data. A lot can occur when this sort of optimization is applied, and sometimes this rule does not work out as well as expected. In most cases, however, it is a good starting point to develop a reasonable visualization for exploratory analysis. If we were to program this rule ourselves, it would look similar to the following:
<br><br>

~~~
## The Freedman-Diaconis Rule ##

# instantiating the maximum inter-quartile range
iqr_max = pd.np.percentile(diamonds['price'], [75])

# instantiating the minimum inter-quartile range
iqr_min = pd.np.percentile(diamonds['price'], [25])

# instantiating the difference between the max and min
iqr_price = float(iqr_max - iqr_min)

# doing a fancy calculation
h = 2 * iqr_price * (len(diamonds['price']) ** -(1/3))

# ranging price
price_range = max(diamonds['price']) - min(diamonds['price'])

# printing the optimal number of bins
print(price_range / h)
~~~

<br>
Luckily for us, someone has already programmed this rule and shared it with the Python community. Thanks to this programmer, we can access the optimal number of bins based on the Freedman-Diaconis Rule with very simple syntax at absolutely no monetary cost. Before moving forward, let's explore what happens when we change the number of bins.

<br><hr style="height:.9px;border:none;color:#333;background-color:#333;" />

In [None]:
# Code 13.3.3

#########################
## Setting Figure Size ##
#########################

# NEW! Setting figure size
fig, ax = plt.subplots(figsize = [10, 15])


###########################
## Plotting First Visual ##
###########################

# NEW! Plotting multiple visuals in the same plot area
plt.subplot(3, 1, 1) # 3 rows, 1 column, space 1


# fd bins
# histogram for carat
sns.histplot(data  = df_dropped,
             x     ='carat',
             bins  = 'fd',
             kde   = False,
             color = 'black')


# titles and axis labels
plt.title(label = "Bins: Freedman-Diaconis")
plt.xlabel(xlabel = 'Carat Weight')
plt.ylabel(ylabel = 'Frequency')


############################
## Plotting Second Visual ##
############################

# plot area 2
plt.subplot(3, 1, 2) # 3 rows, 1 column, space 2


# 15 bins
# histogram for carat
sns.histplot(data  = df_dropped,
             x     ='carat',
             bins  = 15,
             kde   = False,
             color = 'black')


# # titles and axis labels
plt.title(label = "Bins: 15")
plt.xlabel(xlabel = 'Carat Weight')
plt.ylabel(ylabel = 'Frequency')


###########################
## Plotting Third Visual ##
###########################

# plot area 3
plt.subplot(3, 1, 3) # 3 rows, 1 column, space 3


# 150 bins
# histogram for carat
sns.histplot(data  = df_dropped,
             x     ='carat',
             bins  = 150,
             kde   = False,
             color = 'black')


# titles and axis labels
plt.title(label = "Bins: 150")
plt.xlabel(xlabel = 'Carat Weight')
plt.ylabel(ylabel = 'Frequency')


#########################################################
## plt.layout() and plt.show() always go at the bottom ##
#########################################################

# these compile and display the plot so that it is formatted as expected
plt.tight_layout()
plt.show()


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
In short, start with the Freedman-Diaconis rule to determine the optimal number of bins. If you see strange gaps in your visualization, adust the number of bins slightly until the gaps disappear. Remember, our goal is to get a fair representation as to the distrubution of a given feature.
<br><br>
Note that if we were writing an analysis report, each visualization produced would need at least one paragraph explaining why it is worth presenting and the key things to look for when analyzing it. <strong><font style="color:red">Never assume another analyst will interpret a visualization in the same way that you did.</font></strong> If a visualization is not worth at least a paragraph of explanation, it should be removed. Note that this is especially important when visualizing multiple graphs in the same plot area, as in the code above.
<br><br>
<h3>Determining the "Best" Imputation Strategy</h3><br>
Let's keep things simple and focus on a very simple imputation decision: choosing either the mean or the median as our fill in value. To better portray each of these two choices, the code below includes vertical lines at the mean and median for carat weight.

In [None]:
# Code 13.3.4

# setting figure size
fig, ax = plt.subplots(figsize = [8, 5])


# histogram for carat
sns.histplot(data  = df_dropped,
             x     = 'carat',
             bins  = 'fd',
             kde   = True, # drawing theoretical distribution
             color = 'black')


# titles and labels
plt.title(label = "Distribution of Carat Weight")
plt.xlabel(xlabel = 'Carat Weight')
plt.ylabel(ylabel = 'Frequency')


# New: These add vertical lines to the code
plt.axvline(x = df_dropped['carat'].mean(),
            color = 'red')


plt.axvline(x = df_dropped['carat'].median(),
            color = 'blue')


# this adds a legend
plt.legend(labels =  ['mean', 'median'])


# these compile and display the plot so that it is formatted as expected
plt.tight_layout()
plt.show()


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h3>Mean or Median?</h3><br>
Since the distribution for carat weight appears skewed positive (skewed to the right), the median better represents the center of this distribution. Therefore, without any further information, we should choose the median to fill in missing values.
<br><br>
<strong>Note:</strong> If you are familiar with <a href="https://en.wikipedia.org/wiki/Data_transformation_%28statistics%29">data transformations</a>, don't worry about this for now.
<br><br>
<h3>Imputing with <em>.fillna(&nbsp;)</em></h3><br>
As its name implies, the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html">.fillna(&nbsp;)</a> method can be used to fill in the missing values of a given feature. This is exemplified in the following code.

In [None]:
# Code 13.3.5

# soft coding MEDIAN for carat
carat_median = diamonds['carat'].median()


# filling carat NAs with MEDIAN
diamonds['carat'].fillna(value = carat_median,
                         inplace = True)


# checking to make sure NAs are filled in
print(f"""
{'_' * 40}

Any missing values for carat?
{'_' * 40}

{diamonds['carat'].isnull().any()}
""")


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

Let's take a look at what happened to the missing values for carat.

<h4>Practice - Subset the <em>diamonds</em> DataFrame to display <em>m_carat</em> and <em>carat</em> where <em>m_carat</em> is equal to one.</h4>

In [None]:
# Code 13.3.6

# subsetting original missing values for carat
_____


In [None]:
# Sample Solution 13.3.6

# subsetting original missing values for carat
diamonds.loc[ : , ['m_carat', 'carat']][  diamonds.loc[ : , 'm_carat'] == 1  ]


<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
As can be observed, the missing values for <em>carat</em> have been imputed with the median for this feature. Let's overlay the original and imputed distributions for carat weight. This should give us a good indication as to how closely the imputed distribution resembles the original distribution.

In [None]:
# Code 13.3.7

# setting figure size
fig, ax = plt.subplots(figsize = [8, 5],
                       sharex = True, # sharing x-axis between visualizations
                       sharey = True) # sharing y-axis between visualizations


# histogram for carat
sns.histplot(data  = df_dropped,
             x     = 'carat',
             bins  = 21,
             kde   = True, # drawing theoretical distribution
             color = 'red')


# histogram for carat
sns.histplot(data  = diamonds,
             x     = 'carat',
             bins  = 21,
             kde   = True, # drawing theoretical distribution
             color = 'black')


# titles, labels, and formatting
plt.title(label   = "Distribution of Carat Weight")
plt.xlabel(xlabel = 'Carat Weight')
plt.ylabel(ylabel = 'Frequency')
plt.xlim(0.0, 2.75) # setting x-axis range
plt.ylim(0.0, 100) # setting y-axis range


# this adds a legend
plt.legend(labels =  ['original distribution',
                      'imputed distribution'])


# NEW! Saving a figure as an image
plt.savefig(fname = './__images/Imputation of Carat.png')


# these compile and display the plot so that it is formatted as expected
plt.tight_layout()
plt.show()


<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

As can be observed from above, the original and imputed distributions look very similar. Also note that although we imputed with the median, other parts of the overlayed histograms are not fully aligned. This is normal when working with <em>matplotlib.pyplot</em> and <em>seaborn</em>. It also reemphasizes the need for at least one paragraph of explanation per visual that you present.
<br><br>
<h3>Summary</h3><br>
This chapter presented missing value anomalies and fundamental techniques to address them. When working with missing values, keep in mind that there may be an insightful rationale as to why such anomalies are present. This is where art meets science and analysts are able to create tremendous value. As a next step, take a few minutes to analyze and impute the missing values of other features in the dataset. The next chapter will utilize a version of the dataset where all missing values have been imputed.

~~~


                                                  ,---,  
                                               ,`--.' |  
           .---.                        ___    |   :  :  
          /. ./|                      ,--.'|_  '   '  ;  
      .--'.  ' ;   ,---.     ,---.    |  | :,' |   |  |  
     /__./ \ : |  '   ,'\   '   ,'\   :  : ' : '   :  ;  
 .--'.  '   \' . /   /   | /   /   |.;__,'  /  |   |  '  
/___/ \ |    ' '.   ; ,. :.   ; ,. :|  |   |   '   :  |  
;   \  \;      :'   | |: :'   | |: ::__,'| :   ;   |  ;  
 \   ;  `      |'   | .; :'   | .; :  '  : |__ `---'. |  
  .   \    .\  ;|   :    ||   :    |  |  | '.'| `--..`;  
   \   \   ' \ | \   \  /  \   \  /   ;  :    ;.--,_     
    :   '  |--"   `----'    `----'    |  ,   / |    |`.  
     \   \ ;                           ---`-'  `-- -`, ; 
      '---"                                      '---`"  
                                                         


~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h3>Bonus: Tip for Improving Visual Output</h3><br>
The following template is meant to assist you in developing more aesthetically-pleasing data visualizations.
<br><br>
<strong>Titles and Axis Labels</strong> - Use triple-quotes to add more information throughout a visual.

~~~
plt.title(label = """
TITLE
SUBTITLE
""")
~~~


<br><br><hr style="height:.9px;border:none;color:#333;background-color:#333;" />

<br>