# Ski Resorts Exploratory Data Analysis

## Overview - Analyze the datasets which contain the statistical data of the ski resorts across the world, identify the key factors of success and be profitable in a ski resort management. Provide the visualized data information and provide insights for our hypothetical to run the business effectively.

**Import Libraries and Datasets**

In [613]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import seaborn as sns

In [614]:
# Initialize the default style for the matplotlib
plt.style.use('ggplot')
# configure the default figure size for all plots
from matplotlib.pyplot import figure
plt.rcParams['figure.figsize'] = (12, 12)

In [615]:
# import datasets from the csv
resorts_df = pd.read_csv(r'./datasets/resorts.csv', encoding='cp1252')
snow_df = pd.read_csv(r'./datasets/snow.csv', encoding='cp1252')

**1. Previewing the head of each dataframe**

In [616]:
# change limits of rows and columns display

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [617]:
# resorts dataset
resorts_df.head()

Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing
0,1,Hemsedal,60.928244,8.383487,Norway,Europe,46,November - May,1450,620,29,10,4,43,6,325,15,6,0,21,22921,Yes,Yes,Yes,No
1,2,Geilosiden Geilo,60.534526,8.206372,Norway,Europe,44,November - April,1178,800,18,12,4,34,2,100,18,6,0,24,14225,Yes,Yes,Yes,No
2,3,Golm,47.05781,9.828167,Austria,Europe,48,December - April,2110,650,13,12,1,26,9,123,4,4,3,11,16240,Yes,No,No,No
3,4,Red Mountain Resort-Rossland,49.10552,-117.84628,Canada,North America,60,December - April,2075,1185,20,50,50,120,7,0,2,5,1,8,9200,Yes,Yes,Yes,No
4,5,Hafjell,61.230369,10.529014,Norway,Europe,45,November - April,1030,195,33,7,4,44,6,150,14,3,1,18,21060,Yes,Yes,Yes,No


In [618]:
snow_df.head()

Unnamed: 0,Month,Latitude,Longitude,Snow
0,2022-12-01,63.125,68.875,95.28
1,2022-12-01,63.125,69.125,100.0
2,2022-12-01,63.125,69.375,100.0
3,2022-12-01,63.125,69.625,100.0
4,2022-12-01,63.125,69.875,100.0


**2. Data Preparation**

In [619]:
# check total rows and columns in the dataset
resorts_df.shape

(499, 25)

In [620]:
# check total rows and columns in the dataset
snow_df.shape

(820522, 4)

**2.1. Datatype Verification**

In [621]:
# datatype verification
resorts_df.dtypes

resorts_df['Child friendly'].unique()


# resorts_df


array(['Yes', 'No'], dtype=object)

In [622]:
resorts_df['Snowparks'].unique()

array(['Yes', 'No'], dtype=object)

In [623]:
resorts_df['Nightskiing'].unique()

array(['Yes', 'No'], dtype=object)

In [624]:
resorts_df['Summer skiing'].unique()

array(['No', 'Yes'], dtype=object)

In [625]:
# Only Yes, No are contained in the resorts_df
# Child friendly, Snowparks, Nightskiing, Summer skiing columns can be changed to boolean type
resorts_df['Child friendly'] = resorts_df['Child friendly'].map({'Yes': True, 'No': False})
resorts_df['Snowparks'] = resorts_df['Snowparks'].map({'Yes': True, 'No': False})
resorts_df['Nightskiing'] = resorts_df['Nightskiing'].map({'Yes': True, 'No': False})
resorts_df['Summer skiing'] = resorts_df['Summer skiing'].map({'Yes': True, 'No': False})

resorts_df[['Child friendly', 'Snowparks', 'Nightskiing', 'Summer skiing']] = resorts_df[['Child friendly', 'Snowparks', 'Nightskiing', 'Summer skiing']].astype('bool')

resorts_df[['Child friendly', 'Snowparks', 'Nightskiing', 'Summer skiing']].head()


Unnamed: 0,Child friendly,Snowparks,Nightskiing,Summer skiing
0,True,True,True,False
1,True,True,True,False
2,True,False,False,False
3,True,True,True,False
4,True,True,True,False


In [626]:
# datatype verification
snow_df.dtypes

Month         object
Latitude     float64
Longitude    float64
Snow         float64
dtype: object

**2.2 Handle Duplication and Nulls**

In [627]:
# check null values
resorts_df.isna().sum()

ID                     0
Resort                 0
Latitude               0
Longitude              0
Country                0
Continent              0
Price                  0
Season                 0
Highest point          0
Lowest point           0
Beginner slopes        0
Intermediate slopes    0
Difficult slopes       0
Total slopes           0
Longest run            0
Snow cannons           0
Surface lifts          0
Chair lifts            0
Gondola lifts          0
Total lifts            0
Lift capacity          0
Child friendly         0
Snowparks              0
Nightskiing            0
Summer skiing          0
dtype: int64

In [628]:
# check null values
snow_df.isna().sum()

Month        0
Latitude     0
Longitude    0
Snow         0
dtype: int64

In [629]:
# check duplicate values
resorts_df[resorts_df.duplicated()]

Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing


In [630]:
# check duplicate values
snow_df[snow_df.duplicated()]

# no nulls and duplicate values found, move on to the next step

Unnamed: 0,Month,Latitude,Longitude,Snow


**2.3. Data Transform**

In [631]:
import sys
import os

sys.path.append(os.path.abspath('./utils'))

import utils
# module updates handle
import importlib

importlib.reload(utils)

# remove invalid characters from the string
# define a function to transform the invalid characters
pattern = r'[^a-zA-Z0-9\s\w-]'

# check number of invalid characters inside the columns
invalid_mask = resorts_df['Resort'].str.contains(pattern, regex=True)

invalid_count = resorts_df[invalid_mask]['Resort'].count()
invalid_count

211

In [632]:
# hyphens are allowed in the column data
resorts_df = utils.remove_invalid_characters(resorts_df, columns=['Resort'])
resorts_df.head()

Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing
0,1,Hemsedal,60.928244,8.383487,Norway,Europe,46,November - May,1450,620,29,10,4,43,6,325,15,6,0,21,22921,True,True,True,False
1,2,Geilosiden Geilo,60.534526,8.206372,Norway,Europe,44,November - April,1178,800,18,12,4,34,2,100,18,6,0,24,14225,True,True,True,False
2,3,Golm,47.05781,9.828167,Austria,Europe,48,December - April,2110,650,13,12,1,26,9,123,4,4,3,11,16240,True,False,False,False
3,4,Red Mountain Resort-Rossland,49.10552,-117.84628,Canada,North America,60,December - April,2075,1185,20,50,50,120,7,0,2,5,1,8,9200,True,True,True,False
4,5,Hafjell,61.230369,10.529014,Norway,Europe,45,November - April,1030,195,33,7,4,44,6,150,14,3,1,18,21060,True,True,True,False


In [633]:
# Check any invalid characters in the column
resorts_df['Country'].unique()

array(['Norway', 'Austria', 'Canada', 'New Zealand', 'Chile', 'Germany',
       'Switzerland', 'Italy', 'France', 'United Kingdom',
       'United States', 'Andorra', 'Australia', 'Argentina', 'Finland',
       'Japan', 'Slovenia', 'Bulgaria', 'Spain', 'Sweden', 'Lebanon',
       'Russia', 'Ukraine', 'Georgia', 'Serbia', 'Turkey', 'Slovakia',
       'Poland', 'Bosnia and Herzegovina', 'Czech Republic', 'Iran',
       'South Korea', 'Romania', 'Greece', 'Liechtenstein', 'Lithuania',
       'Kazakhstan', 'China'], dtype=object)

In [634]:
# Check any invalid characters in the column
resorts_df['Continent'].unique()

array(['Europe', 'North America', 'Oceania', 'South America', 'Asia'],
      dtype=object)

In [635]:
# The original dataset described the information in a human readable representation, one of the example is the Season column, which represent the period of time by using english month to english month.

# We can break this into two separate columns like "Season From" and "Season To" to make it easier for the periodic analysis

# Check any invalid characters in the column
resorts_df['Season'].unique()

array(['November - May', 'November - April', 'December - April',
       'June - September', 'June - October', 'Year-round',
       'October - June', 'September - June', 'December - March',
       'October - May',
       'December - April, June - August, October - November',
       'July - September', 'November - May, June - August',
       'May - September', 'December - May', 'July', 'September - May',
       'October - April', 'April', 'Unknown', 'July - April',
       'May - October', 'November - June', 'September - April', 'May',
       'June - May', 'November - March', 'March', 'December',
       'October - November, December - May, June - October',
       'July - October'], dtype=object)

In [636]:
# unknown season found - for now, we can just ignore this
resorts_df.query('Season == "Unknown"').head()

Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing
123,124,Courmayeur Checrouit - Val Veny,45.787425,6.973062,Italy,Europe,46,Unknown,2755,1205,16,21,4,41,0,280,4,8,6,18,24497,True,True,False,False
181,182,Mondole Ski-Artesina-Frabosa Soprana-Prato Nevoso,44.249446,7.775081,Italy,Europe,33,Unknown,807,803,42,51,11,104,0,0,19,14,0,33,26068,True,True,True,False
233,234,Mzaar Kfardebian,33.972129,35.839567,Lebanon,Asia,51,Unknown,2465,1850,46,30,4,80,0,0,0,0,0,0,0,True,False,False,False
241,242,Jay Peak,47.631371,-120.829534,United States,North America,70,Unknown,1175,563,15,31,30,76,0,0,2,6,1,9,11675,True,False,False,False
299,300,Oppdal,62.535178,9.623304,Norway,Europe,44,Unknown,1300,585,40,7,9,56,0,0,1,0,0,1,0,True,False,False,False


In [637]:
# some season may contain one or more periods, use October as a search pattern
resorts_df.loc[resorts_df['Season'].str.contains('October')].head()

Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing
7,8,Nevados de Chillan,-36.613844,-72.071805,Chile,South America,57,June - October,2700,1600,10,15,10,35,13,0,4,6,1,11,11080,True,True,False,True
15,16,Treble Cone,-44.632375,168.872825,New Zealand,Oceania,68,June - October,1960,1260,4,9,9,22,0,32,2,2,0,4,4520,False,True,False,True
28,29,Arapahoe Basin,40.121121,-80.669843,United States,North America,83,October - June,3790,3286,40,25,40,105,4,0,2,5,0,7,7200,True,True,False,False
29,30,The Remarkables,-45.05496,168.815859,New Zealand,Oceania,74,June - October,1943,1586,3,4,3,10,1,0,3,4,0,7,8400,True,True,False,True
44,45,Glacier 3000 Les Diablerets,46.351102,7.156644,Switzerland,Europe,54,October - May,3016,1343,9,3,12,24,8,0,4,3,3,10,10260,True,True,False,False


In [638]:

# separating Season to "Season From" and "Season To"
# symbols in the column are ',' and '-'

tmp_df = resorts_df.copy()

tmp_df['split_len'] = tmp_df['Season'].apply(lambda x: x.split(r'[-,]'))

tmp_df


Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing,split_len
0,1,Hemsedal,60.928244,8.383487,Norway,Europe,46,November - May,1450,620,29,10,4,43,6,325,15,6,0,21,22921,True,True,True,False,[November - May]
1,2,Geilosiden Geilo,60.534526,8.206372,Norway,Europe,44,November - April,1178,800,18,12,4,34,2,100,18,6,0,24,14225,True,True,True,False,[November - April]
2,3,Golm,47.057810,9.828167,Austria,Europe,48,December - April,2110,650,13,12,1,26,9,123,4,4,3,11,16240,True,False,False,False,[December - April]
3,4,Red Mountain Resort-Rossland,49.105520,-117.846280,Canada,North America,60,December - April,2075,1185,20,50,50,120,7,0,2,5,1,8,9200,True,True,True,False,[December - April]
4,5,Hafjell,61.230369,10.529014,Norway,Europe,45,November - April,1030,195,33,7,4,44,6,150,14,3,1,18,21060,True,True,True,False,[November - April]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,495,Puigmal,42.395007,2.108883,France,Europe,0,Unknown,2700,1830,9,15,7,31,0,0,11,2,0,13,11865,True,False,False,False,[Unknown]
495,496,Kranzberg-Mittenwald,47.451359,11.228630,Germany,Europe,29,December,1350,980,6,7,2,15,2,8,9,1,0,10,5850,True,True,True,False,[December]
496,497,Wetterstein lifts-Wettersteinbahnen- Ehrwald,47.406897,10.927998,Austria,Europe,43,December - March,1530,1000,15,5,3,23,3,33,6,4,0,10,5425,True,True,False,False,[December - March]
497,498,Stuhleck-Spital am Semmering,47.574195,15.789964,Austria,Europe,42,April,1774,777,18,6,0,24,0,240,7,2,0,9,14400,True,True,True,False,[April]


In [639]:
# standardize the decimal places of the latitude and longitude to comply with the snow_df
resorts_df[['Latitude', 'Longtitude']] = resorts_df[['Latitude', 'Longitude']].round(3)
resorts_df.head()

Unnamed: 0,ID,Resort,Latitude,Longitude,Country,Continent,Price,Season,Highest point,Lowest point,Beginner slopes,Intermediate slopes,Difficult slopes,Total slopes,Longest run,Snow cannons,Surface lifts,Chair lifts,Gondola lifts,Total lifts,Lift capacity,Child friendly,Snowparks,Nightskiing,Summer skiing,Longtitude
0,1,Hemsedal,60.928,8.383487,Norway,Europe,46,November - May,1450,620,29,10,4,43,6,325,15,6,0,21,22921,True,True,True,False,8.383
1,2,Geilosiden Geilo,60.535,8.206372,Norway,Europe,44,November - April,1178,800,18,12,4,34,2,100,18,6,0,24,14225,True,True,True,False,8.206
2,3,Golm,47.058,9.828167,Austria,Europe,48,December - April,2110,650,13,12,1,26,9,123,4,4,3,11,16240,True,False,False,False,9.828
3,4,Red Mountain Resort-Rossland,49.106,-117.84628,Canada,North America,60,December - April,2075,1185,20,50,50,120,7,0,2,5,1,8,9200,True,True,True,False,-117.846
4,5,Hafjell,61.23,10.529014,Norway,Europe,45,November - April,1030,195,33,7,4,44,6,150,14,3,1,18,21060,True,True,True,False,10.529
