# Notes (IQR)

All we are going to do in this file is compute the inter-quartile-range and create a variable that references out data without these outliers. We're also going to create a third variable that excludes companies along with the IQR.

For each, we will also compute a measure of skew to see how the data are shaped.

As a note, the IQR is:
$\text{IQR} = Q3 - Q1$

Outliers are considered:
$Q1 - \text{IQR} \cdot 1.5$ and $Q3 + \text{IQR} \cdot 1.5$ 

In [3]:
# Importing libraries and variables

import pandas as pd
yfin_csv = pd.read_csv(r'https://raw.githubusercontent.com/alexcrockett/Jupyter-Playground/personal/yfin_dataset/02-data/stock_details_5_years.csv')
from yfin_group_range_py import company_averages, first_quartile, fourth_quartile, companies_in_first_quartile, companies_in_fourth_quartile

In [6]:
# Defining the IQR

## Define the 3rd Quartile and median
median = company_averages.quantile(0.5)  # The median (50th percentile)
third_quartile = fourth_quartile  # The third quartile (75th percentile)

## Define companies in the third quartile
companies_in_third_quartile = company_averages[(company_averages > median) & (company_averages <= third_quartile)]

# Define the IQR
IQR = third_quartile - first_quartile

print("IQR= ", float(str(IQR)))

IQR=  112.21191762678677


In [11]:
# Finding outliers
outlier_scope = 1.5 * IQR 

## Calculate the lower bound for outliers
outlier_lower_bound = first_quartile - outlier_scope

## Identify outliers below the lower bound
outliers_lower = company_averages[company_averages < outlier_lower_bound]

# -----

## Calculate the upper bound for outliers
outlier_upper_bound = third_quartile + outlier_scope

## Identify outliers below the lower bound
outliers_upper = company_averages[company_averages > outlier_upper_bound]

In [12]:
# Review the results
print("Median: ", median)
print("--------------------")

print("IQR: ", float(str(IQR)))
print("--------------------")

print("Outliers Scope: ", outlier_scope)
print("--------------------")

print(" ") # Add a little space for reading

print("--------------------")
print("Outliers in the lower bound:")
print(outliers_lower)
print("-------------------")

print("Outliers in the upper bound:")
print(outliers_upper)
print("-------------------")


Median:  80.71573753680524
--------------------
IQR:  112.21191762678677
--------------------
Outliers Scope:  168.31787644018016
--------------------
 
Outliers in the lower bound:
Series([], Name: Average, dtype: float64)
-------------------
Outliers in the upper bound:
Company
ADBE      417.060667
ASML      477.955218
AVGO      445.429283
AZO      1644.378406
BKNG     2126.701202
BLK       604.947085
CHTR      495.415442
CMG      1324.127961
COST      394.662326
CTAS      343.310389
ELV       363.485461
EQIX      638.804183
FCNCA     702.096642
FICO      473.674489
GWW       434.757946
HUBS      357.071477
HUM       401.488365
IDXX      410.794145
INTU      381.231062
LMT       369.306409
LRCX      423.009993
MELI     1032.599313
MPWR      329.397594
MSCI      400.496638
MTD      1106.778430
NFLX      398.320630
NOC       365.952027
NOW       439.452921
NVR      4369.584363
REGN      567.491359
SPGI      326.449222
TDG       552.684847
TMO       448.295794
ULTA      345.397342
UNH  

In [21]:
# Create a new dataframe without the outliers

## Map companies to outliers
outlier_companies = outliers_upper.index

## Remove the comapnies
yfin_restricted_set = yfin_csv[~yfin_csv['Company'].isin(outlier_companies)]

# Notes
`outliers_upper.index` gives you the names of the companies that are outliers.

`yfin_csv['Company'].isin(outlier_companies)` creates a boolean mask where each row is `True` if the company is in the list of outliers and `False` otherwise.

The `~` operator negates the boolean mask, so you select all rows where the company is not an outlier.

`yfin_csv[...]` with this mask removes the rows corresponding to outlier companies.

In [22]:
# Revaluate the mean and median scores

## Means
mean_open_restricted = yfin_restricted_set['Open'].mean() # mean Open
mean_high_restricted = yfin_restricted_set['High'].mean() # mean High
mean_low_restricted = yfin_restricted_set['Low'].mean() # mean Low
mean_close_restricted = yfin_restricted_set['Close'].mean() # mean Close

## Medians
median_open_restricted = yfin_restricted_set['Open'].mean() # median Open
median_high_restricted = yfin_restricted_set['High'].mean() # median High
median_low_restricted = yfin_restricted_set['Low'].mean() # median Low
median_close_restricted = yfin_restricted_set['Close'].mean() # median Close

## Define a set of values to print
central_tendency_restricted = {
    "mean open": mean_open_restricted,
    "mean high": mean_high_restricted,
    "mean low": mean_low_restricted,
    "mean close": mean_close_restricted,
    "median open": median_open_restricted,
    "median high": median_high_restricted,
    "median low": median_low_restricted,
    "median close": median_close_restricted
}

# Print the results
for key, value in central_tendency_restricted.items():
    print(f"{key}: {value:.2f}")

mean open: 96.44
mean high: 97.64
mean low: 95.22
mean close: 96.45
median open: 96.44
median high: 97.64
median low: 95.22
median close: 96.45


# Notes

We can see that the gap between mean and median scores has radiacally reduced by removing ~35 companies from the set of 500. 

This gives us the data-sets
- yfin_csv
- yfin_restricted_set

We want one more set, a set without the outliers and without scores that have extreme ranges. We will also want to rename our datasets so they're easier to type. We will call them 
- set_1
- set_2
- set_3