### This script contains the following points:
#### 1. Importing libraries, defining project path, importing data set expedition.csv
- Importing dataset "expeditions_clean.pkl" as "df_exp"

#### 2. Dropping outliers from complete dataset

#### 3. Creating subset for 2010 - 2020 data

#### 4. Exporting df_no_outs_3 to 'expeditions_clean_2.pkl'

#### 5. Exporting df_exp_recent to 'expeditions_recent_subset.pkl'

## 1. Importing libraries, defining project path, importing data set expedition.csv

In [1]:
# Importing pandas, numpy, and os
import pandas as pd
import numpy as np
import os
import plotly.express as px
import operator

In [2]:
# Defining project folder path
path = r'C:\Users\prena\05-2023 Himalayan Expeditions Analysis'

In [3]:
# Importing orders_products_combined.pkl dataset
df_exp = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'expeditions_clean.pkl'))

In [4]:
pd.set_option('display.max_columns', None)

In [5]:
pd.set_option("display.max_rows", None)

In [6]:
df_exp.describe()

Unnamed: 0,year,total_days,max_elev_reached,total_mbrs,mbrs_summited,mbrs_deaths,hired_abc,hired_summits,hired_deaths,is_no_hired_abc,is_o2_not_used,is_o2_climbing,is_o2_descent,is_o2_sleeping,is_o2_medical,is_o2_used,had_o2_unused,is_o2_unkwn_2,o2_check
count,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0,7982.0
mean,2003.619394,26.552117,7451.066775,5.798797,1.873089,0.061639,2.770358,1.061889,0.027186,0.325232,0.681533,0.278752,0.015535,0.168379,0.034453,0.314959,0.100476,0.004385,1.598472
std,11.849882,15.57331,1009.599699,5.07657,2.843591,0.323586,4.697144,2.419158,0.237743,0.468491,0.465911,0.448413,0.123675,0.374226,0.1824,0.464529,0.300653,0.066077,0.797039
min,1921.0,1.0,3800.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1998.0,14.0,6800.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2006.0,25.0,7400.0,4.0,1.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,2012.0,37.0,8188.0,8.0,3.0,0.0,3.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
max,2020.0,280.0,8850.0,99.0,32.0,5.0,99.0,43.0,7.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0


In [7]:
df_exp.shape

(7982, 27)

## 2. Dropping outliers from complete dataset

#### Dropping outliers max_elev_reached

Our previous script told us that there were 15 outliers for max_elev_reached. By dropping these outliers, our record count should go from 7982 to 7967.

In [8]:
df_exp[(df_exp.max_elev_reached >= 4718) & (df_exp.max_elev_reached <= 10270)].shape

(7967, 27)

In [9]:
df_no_outs_1 = df_exp[(df_exp.max_elev_reached >= 4718) & (df_exp.max_elev_reached <= 10270)]

In [10]:
df_no_outs_1.shape

(7967, 27)

#### Dropping outliers total_mbrs

In [11]:
df_no_outs_1[(df_no_outs_1.total_mbrs >= -7) & (df_no_outs_1.total_mbrs <= 17)].shape

(7736, 27)

In [12]:
7963 - 7732

231

In [13]:
df_no_outs_2 = df_no_outs_1[(df_no_outs_1.total_mbrs >= -7) & (df_no_outs_1.total_mbrs <= 17)]

Our previous script told us that there were 231 outliers for field 'total_mbrs.' This tells us that there were no records that had outliers for  both max_elev_reached and total_mbrs fields. By dropping these outliers, our record count should go from 7967 to 7736. 

In [14]:
df_no_outs_2.shape

(7736, 27)

#### Dropping outliers total_days

In [15]:
df_no_outs_2[(df_no_outs_2.total_days > -20.5) & (df_no_outs_2.total_days <= 71.5)].shape

(7701, 27)

In [16]:
7736 - 7701

35

Our previous script told us that there were 50 outliers for field 'total_days', while there is only 35 records with outliers when using the dataframe that has dropped outliers for max_elev_reached and total_mbrs. 

This tells us that there were 15 records that had an outlier for total_days and an outlier for either max_elev_reached or total_mbrs. By dropping these remaining 35 records that have outliers for total_days, our record count should go from 7736 to 7701.

In [17]:
df_no_outs_3 = df_no_outs_2[(df_no_outs_2.total_days > -20.5) & (df_no_outs_2.total_days <= 71.5)]

In [18]:
df_no_outs_3.shape

(7701, 27)

## 3. Creating subset for 1990 - 2020 data

In [19]:
df_no_outs_3[df_no_outs_3['year']>=1990].shape

(6845, 27)

In [20]:
df_exp_recent = df_no_outs_3[df_no_outs_3['year']>=1990]

In [21]:
df_exp_recent.shape

(6845, 27)

## 4. Exporting df_no_outs_3 to 'expeditions_clean_2.pkl'

In [22]:
# Export data to pkl
df_no_outs_3.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'expeditions_clean_2.pkl'))

## 5. Exporting df_exp_recent to 'expeditions_recent_subset.pkl'

In [23]:
# Export data to pkl
df_exp_recent.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'expeditions_30yrs_subset.pkl'))

## 6. Population Flow
<i>Note, subset filtering out peaks is created in script "Exploring Relationships"</i>