<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br><br><br>
<h1>Python for Business Analytics</h1>
<em>A Nontechnical Approach for Nontechnical People</em><br><br>
<em><strong>Custom Edition for Hult International Business School</strong></em><br>

Written by Chase Kusterer - Faculty of Analytics <br>
Hult International Business School <br>
<a href="https://github.com/chase-kusterer">https://github.com/chase-kusterer</a>
<br><br><br><br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h1>Chapter 11: Introduction to DataFrames</h1>

<h2>11.1 Domain Knowledge Gathering in Business Analytics</h2><br>
Previous chapters were designed to explain critical programming concepts in Python. Before moving forward, it is important to note that concepts such as lists, conditional statements, loops, and user-defined functions exist in virtually every modern programming language, making the knowledge attained up to this point transversal across a variety of tasks involving code. Additionally, now that you have an understanding of the fundamentals, it is feasible to branch out into any number of specializations using Python. For example, if you are interested in another field of programming such as web development, you are now able to pursue such an endeavor. In fact, at the time of this writing, <a href="https://www.netsolutions.com/insights/top-10-python-frameworks-for-web-development-in-2019/">Google, Netflix, and Instagram are all using Python for their web development efforts</a>.
<br><br>
As you (hopefully) expected, the remaining chapters of this book will focus on applying Python to business analytics. <strong>Business analytics is a deep subject</strong>, and a number of analytical philosophies have emerged to try to keep pace with its rapid expansion. The analytical philosophy of this book is centered around a very simple concept:
<br><br>
<div align="center"><strong>
    Business analytics is about applying information to solve problems.
    </strong><a class="tocSkip"></a></div><br>
It is my hope that you feel this is one of the most intuitive concepts ever written. In the opinion of your author, however, an incredible amount of analytical solutions fail to adequately focus on solving their respective problem, let alone even meet the performance of approaches based on experience and/or common sense. This failure is largely due to analysts sitting too far away from the problem they are trying to solve, many with little to no interest in doing anything more than rushing to build a predictive model. Such analysts lack the discipline required to avoid cutting corners, especially in terms of acquiring enough <strong>domain knowledge</strong> to develop a proper solution. Instead, they fall deep into a model metrics rabbit hole, losing site of everything except for a model's predictive performance.<br><br>
Although this chapter is intended to introduce critical aspects of DataFrames, it is important to keep in mind that if working on an analytical project in a business setting, domain knowledge gathering would be a continual process that should not be neglected. As each new DataFrame method is introduced throughout this chapter, take a moment to consider as to how it can be utilized in tandem with domain knowledge in order to develop an analytical solution.
<br><br>

<h3>The Two Month's Salary Challenge</h3><br>
The content of this chapter is reliant on a simple yet elegant dataset. This dataset has been published in <a href="https://www.amazon.com/Marketing-Data-Science-Techniques-Predictive/dp/0133886557">Marketing Data Science - Modeling Techniques in Predictive Analytics with R and Python</a>, written by Dr. Thomas Miller of Northwestern University. As such, it is fitting to allow Dr. Miller to introduce this dataset, as well as the two month's salary challenge:
<br>
<p><em>I never understood why giving a diamond was the social norm when proposing
marriage. As I began searching for an engagement ring, two thoughts
kept racing through my mind: “How will I be able to find the right diamond?”
and “What is this thing going to cost me?” It goes without saying
that my fianc´ee-to-be is worth the expense, but very seldom in our lives do
    we spend two month’s salary on a product we know so little about. </em><a href="./__documents/miller_mds_two_months_salary_case.pdf">Click here to continue reading</a>. Note that if your system's default browser is blocking this link, a .pdf of this documentation can be found in the documents folder of this textbook.</p>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>11.2 Importing Excel Data into Python</h2><br>
If you decided to skip reading the documentation above, go back and do so now. Remember, business analytics is about applying information to solve problems. Solving problems is incredibly challenging without domain knowledge, which encompasses documentation on datasets. A common entry-level mistake is to skip such knowledge gathering and jump right into the data. Such an approach is equivalent to putting on a blindfold and attempting to drive a car: you are actively choosing to ignore important information, making you less capable in what you are trying to achieve.
<br><br>
The <em>pandas</em> package has an excellent method to help us import Excel-style data into Python. This method, <strong>read_excel(&nbsp;)</strong>, has a number of optional arguments designed to make our lives easier. To begin, let's import <em>pandas</em> as <em>pd</em> and then read in the dataset with <strong>pd.read_excel(&nbsp;)</strong>.
<br><br>
<strong>Note:</strong> The <em>pandas</em> package offers several other methods to read data into Python. For example, <strong>read_csv(&nbsp;)</strong> can be utilized to import .csv files. More information on this can be found in <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide">the official User Guide for pandas</a>.

In [None]:
## Code 11.2.1 ##

# importing pandas
import pandas as pd

# storing the path to the dataset
file = "./__datasets/diamonds.xlsx"

# reading in the data
pd.read_excel(io = file)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
As can be observed, <em>Code 11.2.1</em> contains three steps:
<br><br>
<font>
1. Import <em>pandas</em>.<br>
2. Specify the location of the data.*<br>
3. Use read_excel(&nbsp;) to read in the data.</font>

<font><em>* The file is located in a folder named <em>datasets</em>. If the data was in the same folder as this chapter, only the file name and the <em>.xlsx</em> extension would be required for import.</em></font>

<br>
pd.read_excel(&nbsp;) takes one mandatory argument: the <strong>DataFrame</strong> we would like to import. For now, think of a <strong>DataFrame</strong> as a well-organized Excel spreadsheet, where each row represents an observation, and each column represents a <strong>feature</strong> (i.e., a characteristic of an observation). In our dataset, each row represents a diamond engagement ring, and each column represents a characteristic of a diamond engagement ring (carat weight, color, etc.). The table below has been designed to familiarize you with the flexibility of pd.read_excel() through some of its commonly-used arguments. If you haven't already, I strongly suggest you read <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html?highlight=read_excel#pandas.read_excel">this method's documentation</a>.<br><br>

<div style = "width:image width px; font-size:80%; text-align:center;">
<table align="center">
<col width="100">
<col width="10">   
<col width="600">
    <tr>
        <th>Argument</th>
        <th>    | </th>
        <th>Description</th>
    </tr>
    <tr>
        <td>io</td>
        <td>    | </td>
        <td> the file, path, or URL of the data</td>
    </tr>
    <tr>
    <tr>
        <td>sheet_name</td>
        <td>    | </td>
        <td> if your data exists on multiple sheets in Excel, this is the argument to tell Python which sheet to read</td>
    </tr>
    <tr>
        <td>header</td>
        <td>    | </td>
        <td> if your Excel file has column names in the first row, this is the argument to tell this to Python</td>
    </tr>   
    <tr>
        <td>dtype</td>
        <td>    | </td>
        <td> Excel and Python tend to interpret data types differently. This argument helps you control this.</td>
    </tr>
    
          Table 11.1: Extremely useful arguments for pd_read_excel().
</table></div><br>

Although there will be no difference in our current results, <strong><font style="color:red">it is a good practice to explicitly specify optional arguments that we may want to override, even if we are currently using their default values.</font></strong> This is especially important at this stage in your Python journey as it will also help you get more familiar with what each method is capable of. To exemplify, this has been done in <em>Code 11.2.2</em> with the arguments above (with the exception of dtype, which will be explained later).

In [None]:
## Code 11.2.2 ##

# instantiating the dataset as an object
diamonds = pd.read_excel(io         = file, # file to be read in
                         sheet_name = 0,    # could also be 'diamonds'
                         header     = 0)    # row to find column names


# checking the first five rows of the dataset
diamonds.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
The <strong>head(&nbsp;)</strong> method in <em>Code 11.2.2</em> displays the first <em>n</em> rows of a dataset. This is important, as in general:<br><br><br>

<div align="center"><strong>
    Attempting to view an entire dataset is a bad idea.
    </strong><a class="tocSkip"></a></div><br>

There is no value in outputting an entire dataset, and doing so would be the equivalent of printing out an entire Excel file on paper and trying to make sense of the information within. I have yet to see a case where this would make sense for an analysis. This is such a bad idea that Python restricts such attempts by default. To illustrate, please see Figure 11.1 below.

Behind the scenes, <em>pandas</em> reached its printing limit, which protects your kernel from having to process something you are very unlikely to use. Also note that our current dataset of 409 rows and 8 columns is quite small compared to what you may experience in the real world. Imagine how much fun it would be to wait for billions of rows  to render (i.e., display).<br><br>
When suppressing output, <em>pandas</em> defaulted to rendering the first five rows (the default for <strong>head(&nbsp;)</strong>) and the last five rows (the default for <strong>tail(&nbsp;)</strong>, the counterpart of head(&nbsp;)). In practice, this is generally enough to get a quick feel for the data before moving forward.
<br><br>

<p style="padding: 10px; border: 2px solid red;">
<strong><u>Analytical Value of head(&nbsp;)</u><br>
1. Make sure the data imported as expected.<br>
2. Get a brief glance at the quality of the data.</strong>
</p>

<br>
<strong>Note:</strong> There is an option to change the printing limit in <em>pandas</em>. However, this will not be presented in this chapter for the reasons stated above.

<br><br>
<div style = "width:image width px; font-size:80%; text-align:center;"><img src="./__images/chapter-11-suppressed-output.png" width="5000" height="5000" style="padding-bottom:0.5em;"> <em>Figure 11.1: Python's reaction to excessive printing.</em></div>

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>11.3 Exporting a DataFrame to Excel</h2>

DataFrames can also be exported to Excel with the use of <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html">to_excel(&nbsp;)</a>. As you may have imagined, this method has several optional arguments that can be found in its documentation. Additionally, although its use is straightforward, <strong>.to_excel(&nbsp;)</strong> can have unexpected consequences if we are not careful. To demonstrate, the code below is designed to:

1. Export the diamonds dataset to an Excel file.
2. Show a list of all of the files in in datasets folder (including the one that was exported).
3. Reimport and display the head(&nbsp;) of the data, with an unintended extra column.

In [None]:
## Code 11.3.1(a) ##

# storing the path to the dataset
file = "./__datasets/EXPORTED_DATASET.xlsx"

# saving diamonds as a new Excel file in the datasets directory (i.e. folder)
diamonds.to_excel(excel_writer = file)

# This code will not produce an output

<br>

In [None]:
## Code 11.3.1(b) ##

# importing package to list all files in a directory
from os import listdir

# printing all files in the datasets directory
print(listdir(path="./__datasets/"))

<br>

In [None]:
## Code 11.3.1(c) ##

# importing the file that was just created
export_practice = pd.read_excel(io = file)

# checking the first n rows of the file
export_practice.head(n = 5)

<br>
Notice that the new DataFrame in <em>Code 11.3.1(c)</em> has one additional column labeled <em>Unnamed: 0</em>. This column represents the index values (i.e., row numbers) that were present when the file was exported from Python ( <em>Code 11.3.1(a)</em> ). By default, every time we call <strong>to_excel(&nbsp;)</strong>, it will create a new column containing a dataset's index values, even if one already exists. This behavior is valuable in several situations, such as when a dataset has been divided up into various parts so that it can be processed in multiple places (which is essentially what happens in <a href="https://www.dictionary.com/browse/cloud-computing">cloud computing</a>). However, if we were using <strong>to_excel(&nbsp;)</strong> for another purpose, such to create a new file after cleaning up a dataset, the index column is not necessary. <em>Code 11.3.2</em> overrides the <em>index</em> argument of <strong>to_excel(&nbsp;)</strong> so that the columns of the dataset remain consistent with the original file.
<br><br>

In [None]:
## Code 11.3.2 ##

# storing the path to the dataset
file = "./__datasets/EXPORTED_DATASET.xlsx"

# saving diamonds as a new Excel file in the datasets directory (i.e. folder)
diamonds.to_excel(excel_writer = file, # path to the dataset
                  index = False)       # removing index column

# importing the file that was just created
export_practice = pd.read_excel(io = file)


# checking the first n rows of the file
export_practice.head(n = 5)

<br><br>
<p style="padding: 10px; border: 2px solid red;">
<strong><u>Analytical Value of to_excel(&nbsp;)</u><br>
1. Save a copy of a dataset at various stages of an analysis.<br>
2. Develop datasets for Excel users.</strong>
</p>

<br><br>
<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>11.4 The Anatomy of a DataFrame</h2><br>
Now that the diamonds DataFrame has been instantiated (i.e., loaded into Python's virtual environment), we are able to access and explore it in a similar fashion as lists. However, performing such operations on DataFrames is slightly more complicated as unlike lists, DataFrames are <strong>multidimensional</strong>. In other words, lists have one dimension (rows), whereas DataFrames have two (rows and columns).
<br><br>
For now, let's think of each row of a DataFrame as a list. Also, recall from <strong>Chapter 5: Lists and List Operations</strong> that the elements of a list can be of different data types. This is also true of the rows of a DataFrame. To illustrate further, let's analyze the data types of the first five rows of the dataset.

<br>
<div style = "width:image width px; font-size:80%; text-align:center;"><img src="./__images/chapter-11-diamonds_head_output.png" width="400" height="400" style="padding-bottom:0.5em;"> <em>Figure 11.2: First Five Rows of the diamonds DataFrame</em></div>

<br>
The data types of each element outputted above are as follows*:
<br><br>

~~~
['int', 'float', 'float', 'float', 'float', 'int', 'int', 'int'],
['int', 'float', 'float', 'float', 'float', 'int', 'int', 'int'],
['int', 'float', 'float', 'float', 'float', 'int', 'int', 'int'],
['int', 'float', 'float', 'float', 'float', 'int', 'int', 'int'],
['int', 'float', 'float', 'float', 'float', 'int', 'int', 'int']
~~~

<br>
<font><em>* Methods for analyzing data types will be covered in a later chapter.</em></font>
<br><br><br>
Once again, DataFrames allow row-wise elements to be of different data types. However, data types for the elements of each column must be the same. This is important to remember, as Python follows a <a href=https://docs.python.org/3/reference/datamodel.html#the-standard-type-hierarchy>built-in hierarchy</a> when column-wise data types are inconsistent. For example, if a string were present in the <em>carat</em> column, Python would convert every element in this column to a string, even if every other element was a float. At this point in your Python journey, understanding the built-in hierarchy in detail is not necessary. For now, just know that when a column's data type is different than expected, it is likely that the column in question contains anomalies (unexpected strings, missing values, etc.). Such anomalies will be addressed in a later chapter.
<br><br>
<h3>Series Objects</h3>

DataFrames that contain only one column are known as <a href="https://pandas.pydata.org/pandas-docs/stable/reference/series.html">Series objects</a>. The following code snippet exibits how to extract a <strong>Series</strong> from a DataFrame.
<br><br>

~~~
# subsetting a series column
DataFrame['column name']
~~~

<br>
As can be observed, subsetting a <strong>Series</strong> is very similar to indexing a list. Note that column labels are treated as strings, thus the column being extracted must be encompassed in quotation marks.
<br>

On the technical side, since <strong>Series</strong> objects represent one column of data and each element of a column needs to be of the same type, Python processes them more efficiently than DataFrames. This comes with a minor drawback for users of Jupyter Notebook and most other <a href="https://en.wikipedia.org/wiki/Integrated_development_environment">Interactive Development Environments</a>: The native output for <strong>Series</strong> objects is less easy for humans to read. This can be observed from <em>Code 11.4.1</em> on the right.
<br><br>
Since we extracted a single column of data, Python made a type conversion behind the scenes (from DataFrame to Series). This conversion can be avoided by providing Python with a list instead of a single column name. Python will interpret this differently and return a DataFrame instead of a Series, even if only one column name is referenced within the list. This has been done in <em>Code 11.4.2</em>.
<br><br>
<em>Code 11.4.3</em> and <em>Code 11.4.4</em> further elaborate on this concept by outputting each object's type.
<br><br><br><br>

<br>

In [None]:
## Code 11.4.1 ##

# carat column
diamonds['carat'].head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~
# slicing multiple columns
DataFrame[ [list of columns] ]
~~~

<br><hr style="height:.9px;border:none;color:#333;background-color:#333;" />

In [None]:
## Code 11.4.2 ##

# providing a list
diamonds[  ['carat']  ].head(n = 5)

<br>

In [None]:
## Code 11.4.3 ##

# checking the type of this object
type(diamonds['carat'])

In [None]:
## Code 11.4.4 ##

# note that the spacing is optional
type(diamonds[    ['carat']    ])

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>
<h2>11.5 Subsetting DataFrames</h2><br>

The logic of subsetting a Series as a DataFrame can be extended to include multiple columns: simply provide more column names inside the list as shown in <em>Code 11.5.1</em>. Generally, however, <strong>Python prefers we use a different logic when subsetting so that we avoid problems down the road</strong>. These methods, <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html">.loc[&nbsp;]</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas-dataframe-iloc">.iloc[&nbsp;]</a> are incredibly important for several reasons, as explained in <a href="https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html">the Python documentation on indexing and selecting data</a>. In short, these methods were designed to stably operate on DataFrames. For now, just remember that when subsetting DataFrames, it's best to use <strong>.loc[&nbsp;]</strong> and <strong>.iloc[&nbsp;]</strong> until you are advanced enough to have a compelling reason not to.
<br><br>
<p style="padding: 10px; border: 2px solid red;">
<strong><u>Analytical Value of .loc[&nbsp;] and .iloc[&nbsp;]</u><br>
1. Stably index DataFrames.<br>
</strong>
</p>

<br>

In [None]:
## Code 11.5.1 ##

# carat and price columns
diamonds[  ['carat', 'price']  ]\
.head(n = 5)

<hr style="height:.9px;border:none;color:#333;background-color:#333;" /><br>

~~~
# using loc[]
DataFrame.loc[row NAMES, column NAMES]

# using iloc[]
DataFrame.iloc[row NUMBERS, column NUMBERS]
~~~

<br><hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
Simply stated, the difference between these two methods is that <strong><em>.loc[&nbsp;]</em> works with names and <em>.iloc[&nbsp;]</em> works with numbers</strong>. The <em>.loc[&nbsp;]</em> method also accepts index numbers if a DataFrame does not have row names. Also note that the logic in the snippet above can be extended by providing lists of row/column values.<br>

<h3>Subsetting with <em>.loc[&nbsp;]</em> and <em>.iloc[&nbsp;]</em></h3><br>
The following two codes will output the same result.

In [None]:
## Code 11.5.2 ##

# slicing using .loc[]
diamonds.loc[ : , 'color']

In [None]:
## Code 11.5.3 ##

# slicing using .iloc[]
diamonds.iloc[ : , 2]

<br>
That being stated, you may be wondering why Python has two logics to accomplish the same task. The rationale behind this is identical to having the ability to use multiple wrapper types in print(&nbsp;) statements: Sometimes it makes more sense to use one method over the other. Such situations will become apparent as you gain more experience with DataFrames, and below are key advantages of each method to jump start your thinking.
<br><br><br>

<hr style="height:.3px;border:none;color:#333;background-color:#333;" /><br>

<strong><u>Key Advantages of .loc[&nbsp;]</u></strong>

* DataFrames are <a href="https://www.merriam-webster.com/dictionary/mutable">mutable</a> and the positions of rows/columns can change. Referencing DataFrame elements using labels helps ensure results are consistent when a DataFrame is expected to change shape or order.
* Labels tend to make it easier to follow your own work. For example, it is much easier to remember that you are operating on <em>carat</em> using <em>diamonds.loc[ : , 'carat' ]</em> than when using <em>diamonds.loc[ : , 1 ]</em>.

<br><strong><u>Key Advantages of .iloc[&nbsp;]</u></strong>

* All column names need to be explicitly referenced when slicing with labels. This task can be much more efficient when referencing numbers with  <em>.iloc[&nbsp;]</em>.
* It is generally easier to loop over rows/columns using numbers as opposed to labels.

<br><hr style="height:.3px;border:none;color:#333;background-color:#333;">
<br><br>
Note the semicolons ( : ) in <em>Codes 11.5.2</em> and <em>11.5.3</em> above. Just as with slicing lists, this syntax represents a range. By itself, a semicolon represents all data (i.e., all rows or columns). A number on left side of a semicolon represents a starting point for the range. Likewise, a number on the right side of a semicolon represents the end of a range. This can get a little tricky, as <strong><em>.loc[&nbsp;]</em> is inclusive of a range's endpoint while <em>.iloc[&nbsp;]</em> is exclusive</strong>.
<br><br>
Once again, DataFrames and lists have several similarities, and your efforts in <strong>Chapter 5: Lists and List Operations</strong> will pay dividends as we move forward. Below are more examples of slicing using <em>.loc[&nbsp;]</em> and <em>.iloc[&nbsp;]</em>. Pay close attention to the subtle differences in using each method.

In [None]:
## Session 11.5.4 ##

# slicing rows/columns with .loc[]
diamonds.loc[ 5:9 , 'color' ]

In [None]:
## Code 11.5.5 ##

# slicing rows/columns with .iloc[]
diamonds.iloc[ 5:10 , 2 ]

<hr style="height:.3px;border:none;color:#333;background-color:#333;">

In [None]:
## Code 11.5.6 ##

# slicing rows/columns with .loc[]
diamonds.loc[ :4 , ['carat',
                    'color',
                    'clarity',
                    'cut'] ]

In [None]:
## Session 11.5.7 ##

# slicing rows/columns with .iloc[]
diamonds.iloc[ :5 , 1:5 ]




<br><hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>
<h2>11.6 Subsetting Exercises</h2><br>
Complete the code in the following exercises by reproducing the given output using <em>.loc[&nbsp;]</em> or <em>.iloc[&nbsp;]</em>.

In [None]:
## Code 11.6.1(a) ##

# .loc[]
diamonds.loc[ 10:15 , ['carat',
                       'cut',
                       'price'] ]

In [None]:
## Code 11.6.1(b) ##

# .iloc[]
____



In [None]:
## Sample Solution 11.6.1 ##

# .loc[]
diamonds.iloc[ 10:16 , [1, 3, 7] ]

<br><hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>

In [None]:
## Code 11.6.2(a) ##

# .iloc[]
diamonds.iloc[ 400: , 2:4 ]

In [None]:
## Code 11.6.2(b) ##

# .loc[]
____

In [None]:
## Sample Solution 11.6.2 ##

# .loc[]
diamonds.loc[ 400: , ['color', 'clarity']]

<br><hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>

In [None]:
## Code 11.6.3(a) ##

# .iloc[]
diamonds.iloc[ 3:9 , [4, 6, 7] ]


In [None]:
## Code 11.6.3(b) ##

# .loc[]
___


In [None]:
## Sample Solution 11.6.3 ##

# .loc[]
diamonds.loc[ 3:8 , ['cut',
                     'store',
                     'price'] ]

<br><hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>

In [None]:
## Code 11.6.4(a) ##

# .loc[]
diamonds.loc[ 10:15 , ['carat',
                       'cut',
                       'price'] ]

In [None]:
## Code 11.6.4(b) ##

# .iloc[]
_____



In [None]:
## Sample Solution 11.6.4 ##

# .iloc[]
diamonds.iloc[ 10:16 , [1, 3, 7] ]

<br><hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>

<h2>11.7 Conditional Subsetting</h2><br>
<strong>Conditional subsetting</strong>, or filtering a DataFrame based on a set of conditions, can be accomplished using square brackets. At first, this can be a bit confusing, as filters are often used in tandem with lists, which also use square brackets. A good approach is to remember that Python does not care about spacing, and use this to your advantage.
<br><br>
The following snippet outlines the format of conditional subsetting on DataFrames:<br><br>

~~~
DataFrame.loc[ rows, columns ][ condition ]
~~~

<br><br>
This format can be extended to as many conditions as needed:<br><br>

~~~
DataFrame.loc[ rows, columns ][ condition 1 ][ condition 2 ][ condition n ]
~~~


<br><br>For example, if we were interested in diamond engagement rings priced at $1,000 or less, we could conditionally subset the DataFrame as shown below. Also note that the use of <em>.loc[&nbsp;]</em> in the code below is optional, but a best practice (syntax such as <em>diamonds[ condition ]</em> is also valid for conditional subsetting).

In [None]:
## Session 11.7.1 ##

diamonds.loc[ : , : ][diamonds.loc[ : ,'price'] <= 1000]

<hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>
Now let's add some flavor to our subsetting by relating diamond engagement ring prices to per capita income. Recall from <a href="./__documents/miller_mds_two_months_salary_case.pdf">the documentation</a> that the diamond data was collected in the city of Chicago in 2007. Per capita income for the city of Chicago is available on <a href="https://www.census.gov/">the U.S. Census Bureau website</a> and the most relevant income data we can gather within an arm's reach is collected from 2014-2018. Although this does not perfectly align with the data collection period, it is a reasonable place to start, and the U.S. Census Bureau comes from a globally-respectable source. Also remember that if we soft code, we can update the per capita income as new information becomes available.
<br><br>
<p>From 2014-2018, <a href="https://www.census.gov/quickfacts/fact/table/chicagocityillinois/LND110210">the per capita income in the city of Chicago</a> was \&#36;34,775. Our case is titled "Two Month's Salary", leaving a "normal" Chicago citizen with a budget of approximately \&#36;5,795.83 to spend on an engagement ring. Since the data in the dataset does not include decimal places, let's round this down to \&#36;5,795 as to not go over budget.
<br><br>    
A little side note before we get started: At one point in my life, the person I was dating tried to convince me that 2 month's salary meant two months potential salary. For our purposes, we will assume this is not how the engagement ring heuristic is meant to be interpreted (but it's always great to norm on expectations!).</p>

In [None]:
## Session 11.7.2 ##

diamonds.loc[ : , : ][diamonds.loc[ : , 'price'] <= 5795]

<hr style="height:.3px;border:none;color:#333;background-color:#333;"><br>
It looks like we have a lot of diamonds to choose from! Let's try to narrow this down to the five "best" diamonds given our budget. "Best" is subjective and depends on the analyst. As such, develop your own criteria, and then use the open code block below to subset your way to success! The sample solution for this exercise can be found within the next chapter (a starter solution is also available below if you get stuck). Remember, you can use a <em>len()</em> wrapper around your code to output your current diamond count, as in the code below:<br><br>

~~~
# current diamond count
print( f"Diamonds Remaining: {len(diamonds.loc[ : , : ][diamonds['price'] <= 5795])}" )
~~~

<br>

In [None]:
## Session 11.7.3 ##

## Open Coding Block ##






In [None]:
## Sample Solution 11.7.3 ##

# Note: The warnings you likely experienced in this exercise are are normal.
#       I would show you how to supress them but that is a very bad idea at
#       this stage in your Python journey.


# filters developed for each quantitative metric
diamonds.loc[ : , : ][diamonds.loc[ : , 'price'] <= 5795] \
                     [diamonds.loc[ : , 'carat'] >= 0.9]  \
                     [diamonds.loc[ : , 'color'] <= 6]    \
                     [diamonds.loc[ : , 'clarity'] <= 5]  \
                     [diamonds.loc[ : , 'cut'] <= 1]

<hr style="height:.3px;border:none;color:#333;background-color:#333;">
<hr style="height:.3px;border:none;color:#333;background-color:#333;">
<br>

~~~
  _    _                             _ 
 | |  | |                           | |
 | |__| | ___   ___  _ __ __ _ _   _| |
 |  __  |/ _ \ / _ \| '__/ _` | | | | |
 | |  | | (_) | (_) | | | (_| | |_| |_|
 |_|  |_|\___/ \___/|_|  \__,_|\__, (_)
                                __/ |  
                               |___/   
~~~

<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<hr style="height:.9px;border:none;color:#333;background-color:#333;" />
<br>
<h4><u>Diamonds Dataset Source</u></h4>
Copyright Date: 2007<br>Author: Brian Pope<br>Publisher: Research Publishers LLC.<br>Publisher Location: Manhattan Beach, California<br>Permission for use given by Dr. Thomas Miller, Northwestern University

<br>