# Exploratory Data Analysis of Spanish Railway Ticket Pricing

Data Source : [Spanish Rail Tickets Pricing](https://www.kaggle.com/datasets/thegurusteam/spanish-high-speed-rail-system-ticket-pricing)
<br> <br>
![Imgur](https://imgur.com/aQxs8Ho.jpg)
<br>


## EDA (Exploratory Data Analysis) 

>### What is EDA?
**EDA (Exploratory Data Analysis)** is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on. 

>When we say 'Easy to work on', we mean to say that the data thus extracted can be used to get a lot of useful insights and answer a lot of questions, finding answers to which would not be such an easy task, if we did not have that data stored with us in a simple and sorted manner, i.e. generally in an ` Excel File or a CSV file`.

>Exploratory Data Analysis, or EDA, is an important step in any Data Analysis or Data Science project. EDA is the process of investigating the dataset to discover patterns, and anomalies (outliers), and form hypotheses based on our understanding of the dataset.

>  EDA is basically used to see what data can reveal beyond the formal modelling or hypothesis testing task and provides better understanding of data set variables and the relationship between them. Originally developed by American mathematician Jhon Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

>EDA can help us deliver great business results, by improving our existing knowledge and can also help in giving out new insights that we might not be aware of
![](https)



**Tools Used**

* `opendatasets` (Jovian library to download a Kaggle dataset) 
<br>

* Data cleaning:
  
  1.`Pandas`
  
  2.`Numpy`
<br>

* Data Visualization
  
  1.`Matplotlib` 
  
  2.`Seaborn`
  
  3.`plotly`
  
  4.`folium`

## About the Project
In this project, we are trying to analyse Spain Rail Ticket Data. This selected dataset covers data about the different trips by different trains with differnt duration and prices. 
Personally I find `Pricing` quite interesting because it helps to analyze about pattern of the change in price over time period and many factors.

## Steps followed

### Step 1: Selecting a real world dataset:
* We will download our dataset from `Kaggle`   using the library `opendataset` created by `Jovian` which imports the datasets directly from the 'kaggle' website

  import opendatasets as od

  dataset = 'https://www.kaggle.com/datasets/thegurusteam/spanish-high-speed-rail-system-ticket-pricing'
od.download(dataset)

### Step 2: Performing data preperation & cleaning
* We will load the dataset into a dataframe  using Pandas, explore the different columns and range of values, handle missing values and incorrect datatypes and basically make our data ready to use for our analysis.

### Step 3:Perform exploratory analysis and visualization and asking interesting questions
 * We will compute the mean, sum, range and other interesting statistics for numeric columns, explore distributions of numeric columns using histogram etc, make a note of interesting insights from the exploratory analysis, ask interesting questions about the dataset and look for their answers through visualizing our data. 

## **Happy Coding!!**
Use the "Run" button to execute the code.

### **Install packages and import libraries**

Uncomment the below cell, to install all the required packages

In [1]:
# !pip install jovian --upgrade --quiet
# !pip install scipy pandas seaborn numpy matplotlib wordcloud  --upgrade --quiet
# !pip install opendatasets plotly folium --upgrade --quiet

In [2]:
# Import library to download data from Kaggle
import opendatasets as od

# Import python data analysis libraries
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize      #---- Handles outliers

# Import visualisation libraries
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px
import folium
import folium.plugins as plugins
from wordcloud import WordCloud

# Import other required libraries
import jovian
import os


Let us set a base style for all our visualisations in this notebook. These customisations from [Mounir](https://jovian.ai/kara-mounir) create excellent clarity images to use in presentations. You can override them and customise individual plots as  required.

In [3]:
# Set plot parameters for the notebook
%matplotlib inline
sns.set_style('white')
matplotlib.rcParams['font.size'] = 18
matplotlib.rcParams['figure.figsize'] = (18, 10)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
matplotlib.rcParams['xtick.major.pad']='10'
matplotlib.rcParams['ytick.major.pad']='10'

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="zerotoanalyst-rail-ticket-pricing-eda")

<IPython.core.display.Javascript object>

### **Downloading the Dataset**


#### **Working with large datasets**

**To download dataset from kaggle we have to provide an `API Key` which can done by following two methods**
>1. By entering the `API Key` manually at the runtime.
>2. By providing the `API Key` by a file `kaggle.json`, which automatically does the authentication for us.

In [None]:
dataset_url = 'https://www.kaggle.com/datasets/thegurusteam/spanish-high-speed-rail-system-ticket-pricing?select=thegurus-opendata-renfe-trips.csv'
od.download(dataset_url)

Let us check if the data has been downloaded.

In [None]:
data_dir = './spanish-high-speed-rail-system-ticket-pricing'
os.listdir(data_dir)

The data has been downloaded and unzipped to the folder `./spanish-high-speed-rail-system-ticket-pricing` Let us now check the size of the folder. There is one file in this folder.  
`thegurus-opendata-renfe-trips.csv` 7.24 GB
> We will use this file for our analysis

>There are 38.75 million records in thegurus-opendata-renfe-trips csv

In [None]:
%%time
#load the file using Pandas. 
rail_ticket_price_df = pd.read_csv(data_dir+'/thegurus-opendata-renfe-trips.csv')

Phenomenal!! That's 7.24 GB data over 38.75 million records loaded into a Pandas dataframe in under  8 mins!!

In [None]:
rail_ticket_price_df.info()

There are 38.75 million rows and 13 columns in the dataset. Using Pandas this dataset is now 4 GB  from 7.24 GB as a `.csv` 

#### **Work with a sample - a fraction of the dataset**

As this massive dataset let's use 1% of the data to build our EDA framework. This is very important and often overlooked. Working on a smaller sample saves significant time while experimenting with code.

Once we have the full framework ready, we can run the analysis on the complete dataset.

In [None]:
%%time
rail_ticket_price_sample_df = rail_ticket_price_df.sample(frac=0.01)
rail_ticket_price_sample_df.info()

Our sample dataset now contains 387k rows and the same 13 columns. It is 44.3 MB in size.

#### **Save intermediate results**

While working with large datasets, you may runtinto runtime issues. Therefore it will be helpful to save the intermediate results to your google drive or a local folder to pick up and continue from the point.  

Let us save this sample dataset on our local machine. We will use the data in this file to build our framework.

We will continue to save short snapshots of data offline as we move along the notebook. You can save the data in various formats including binary formats such as `.feather`

In [None]:
# Write the DataFrame to CSV file.
rail_ticket_price_sample_df.to_csv('rail_ticket_price_sample.csv')

#### **Explore faster loading and lesser memory**

We will use some techniques to load data to Pandas faster and use less memory.

- **drop columns:** select a subset of columns relevant for analysis
- **identify categorical columns:** change the dtype tp `category`
- **parse_dates:** change columns with date\time to type `DateTime`
- set **DateTime** column as the **index**
- **use smaller dtypes**: we don't see any need as of now

In [None]:
# Use only a subset of the columns
selected_cols = ['origin', 'destination', 'departure', 'arrival', 'duration',
                 'vehicle_type', 'vehicle_class', 'price', 'fare']

selected_dtypes = {
    'duration': 'float32',
    'price': 'float32',   
}

**Note:** Remove comment in the cell below to run the analysis on the appropriate dataset.

In [None]:
#Sample data
# sample_csv_url = './rail_ticket_price_sample.csv'

#Full data
sample_csv_url = './spanish-high-speed-rail-system-ticket-pricing/thegurus-opendata-renfe-trips.csv'

Load the data to a pandas DataFrame

In [None]:
%%time

ticket_price_df = pd.read_csv(sample_csv_url, 
                            usecols=selected_cols, 
                            dtype=selected_dtypes, 
                            parse_dates=['departure', 'arrival'])

# TEST

In [None]:
ticket_price_df.info()

In [None]:
px.scatter(ticket_price_df,
          x = 'price',
          y = 'vehicle_type')

# End TEST

After selecting only `required columns` and defining `datatypes` before reading,
The total time dropped from `11min 18sec` to `4min 19sec`

In [None]:
ticket_price_df.shape

Our Dataset now have around `38.75 million` of records and we have selected `9` Features out of `13` to perform our analysis

In [None]:
ticket_price_df.info()

In [None]:
#Save Work
jovian.commit()

### **Data pre-processing**

Now that we have loaded the data into a Pandas dataframe, let us process the data for the following

- drop duplicates
- replace missing values
- check for outliers

You could analyse the data more to further clean up the data.

#### **Drop duplicates**

Print number of unique values for each columns before checking for duplicates. These are fairly time consuming when we run them on the complete data.

In [None]:
ticket_price_df.nunique()

In [None]:
#save duplicates for analysis
duplicates_df = ticket_price_df[ticket_price_df.duplicated(['origin','destination','departure','arrival','duration','vehicle_type','vehicle_class','price','fare'])]
len(duplicates_df)

We found 38,343,380 duplicate entries
>by this we can see that how many duplicate entries can be present in a dataset, and the importance of `Data Cleaning`

In [None]:
#group duplicates by departure and arrival for further analysis
grp_df = (duplicates_df.groupby( ['origin','destination','departure','arrival','duration','vehicle_type','vehicle_class','price','fare'])[[ "departure", "arrival"]].size()
                      .reset_index(name='group_count')
                      .sort_values(by= 'group_count',ascending= False))
#Look at the highest number of duplicates.
grp_df.head(5)

In [None]:
#drop duplicate rows
ticket_price_df.drop_duplicates(keep=False, inplace= True)

#check for missing data
missing_data_pct = ticket_price_df.isna().sum().sort_values(ascending=False)/ticket_price_df.shape[0]
missing_data_pct*100

In [None]:
#check if we still have any duplicates
ticket_price_df.drop_duplicates().duplicated().any()

In [None]:
ticket_price_df.shape

#### **Check for outliers**

In [None]:
pd.options.display.float_format = "{:.2f}".format
ticket_price_df.describe()

By looking as the values of percentiles we can find out that we have outliers in `duration` and `price` column.
We will view the outliers and remove them

##### Handling outliers of `duration` column using `Winsorize Method`

In [None]:
# Boxplot for duration column to spot the outliers

matplotlib.rcParams['figure.figsize'] = (8, 4)
sns.boxplot(x = ticket_price_df['duration'])
plt.title('Total Duration of trip')
plt.xlabel('Total time (Hrs)')

In [None]:
print(ticket_price_df.duration.quantile(0.01))
print(ticket_price_df.duration.quantile(0.98))

In [None]:
ticket_price_df['duration'] = winsorize(ticket_price_df['duration'], (0.01, 0.02))

In [None]:
# Boxplot for duration column to spot the outliers

matplotlib.rcParams['figure.figsize'] = (8, 4)
sns.boxplot(x = ticket_price_df['duration'])
plt.title('Total Duration of trip')
plt.xlabel('Total time (Hrs)')

##### Handling outliers of `price` column using `Q3 - Q1` method

In [None]:
# Boxplot for Price column to spot the outliers

matplotlib.rcParams['figure.figsize'] = (8, 4)
sns.boxplot(x = ticket_price_df['price'])
plt.title('Total Cost of trip')
plt.xlabel('Total Price (Euro)')

In [None]:
#Q3-Q1-->73.85-37.80-->36.05/2-->18.025
ticket_price_df.drop(ticket_price_df[ticket_price_df.price < (37.80-18.025)].index, inplace=True)
ticket_price_df.drop(ticket_price_df[ticket_price_df.price > (73.85+18.025)].index, inplace=True)

In [None]:
# Boxplot for Price column to spot the outliers

matplotlib.rcParams['figure.figsize'] = (8, 4)
sns.boxplot(x = ticket_price_df['price'])
plt.title('Total Cost of trip')
plt.xlabel('Total Price (Euro)')

In [None]:
ticket_price_df.describe()

In [None]:
ticket_price_df.shape

#### **Find and replace missing values**

In [None]:
ticket_price_df.isnull().sum()

In [None]:
ticket_price_df['price'].mean()

In [None]:
# ticket_price_df.to_csv('price_null.csv', index = False)

In [None]:
# ticket_price_df['price'].fillna(ticket_price_df['price'].mean(), inplace = True)
# price_null_df = pd.read_csv('price_null.csv')
ticket_price_df['price'] = ticket_price_df['price'].interpolate()
ticket_price_df

In [None]:
ticket_price_df.isnull().sum()

In [None]:
ticket_price_df['vehicle_class'].fillna(ticket_price_df['vehicle_class'].mode()[0], inplace = True)
ticket_price_df['fare'].fillna(ticket_price_df['fare'].mode()[0], inplace = True)

In [None]:
ticket_price_df.isnull().sum()

In [None]:
ticket_price_df.info()

# Graphs

In [None]:
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9,5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

## Section 1

#### Average Cost for travelling by different types of Train.

In [None]:
vehicle_type_list = ticket_price_df['vehicle_type'].value_counts().index.tolist()
vehicle_type_dict = {'vehicle_type':[],
                     'price':[]}
for vehicle_type in vehicle_type_list:
    average_ticket_price = (ticket_price_df.loc[ticket_price_df['vehicle_type'] == vehicle_type, 'price'].mean())
    vehicle_type_dict['vehicle_type'].append(vehicle_type)
    vehicle_type_dict['price'].append(average_ticket_price)

avg_price_df = pd.DataFrame.from_dict(vehicle_type_dict)
# avg_price_df

In [None]:
plt.figure(figsize=(16,6))
plt.xticks(rotation = 75)
plt.title('Cost of travelling by different Train Types')
sns.barplot(x=avg_price_df.vehicle_type, y = avg_price_df.price)
plt.xlabel('Type of Train')
plt.ylabel('Cost of travelling')

Here we can see that the `Cost of travelling` by `AVE` lies in top 10%

### Passengers travelling by

In [None]:
vehicle_type = ticket_price_df['vehicle_type'].value_counts()
# vehicle_type

In [None]:
plt.figure(figsize=(18,5))
sns.barplot(x=vehicle_type.index, y = vehicle_type)
plt.ylabel('No. of Trips')
plt.xticks(rotation = 75)
plt.xlabel('Type of Train')
plt.title('No. of Passengers Travelling by (Train Types)')

 It seems that disproportionately high number of passenger are travelling from the` Alta Velocidad Española (AVE)` which travells at a speed of `310 km/h`. Even thou the cost of travelling by `AVE` is quite high (lies in `top 10%`) ,we can clearly see here that most of the passengers like to travell by high-speed-train as it takes less time to reach to the destination

## Section 2

> Adding `month` column to see trends over month

In [None]:
ticket_price_df['month']=ticket_price_df['departure'].dt.month_name()
ticket_price_df['month_no']=ticket_price_df['departure'].dt.month
ticket_price_df.sample(5)

###  Average duration for which train runs, over different months

In [None]:
month_duration_df = ticket_price_df[['month_no','month','duration']].groupby(['month_no','month']).mean()
month_duration_df.reset_index(inplace = True)
matplotlib.rcParams['figure.figsize'] = (16,6)
plt.xlabel('Months')
plt.ylabel('Average Travelling Time (Hrs)')
plt.title('Average Duration of Travelling over different Months')
plt.xticks(rotation = 45)
sns.lineplot(data = month_duration_df, x="month", y = 'duration',color = 'green')


Here we can clearly see that the Train's average running time is quite high during the month of `April` and `May` and very low in the month of `Feburary`, `June` and `September` , so we can `reduce` the number of trains when the average running time of trians are very low i.e during the month of `Feburary` and `October`

### Total Earning from Train, throughout the year

In [None]:
month_price_df = ticket_price_df[['month_no','month','price']].groupby(['month_no','month']).sum()
month_price_df.reset_index(inplace = True)
matplotlib.rcParams['figure.figsize'] = (16,6)
plt.xlabel('Months')
plt.ylabel('Average Travelling Cost (Euro) ')
plt.title('Total Earning from Railways thoughout the year')
plt.xticks(rotation = 45)
sns.lineplot(data = month_price_df, x="month", y = 'price',color = 'orange')


Total Earnings of Government by Railways in Spain is high during the month of `March` and very low for the month of `June`, `September`, `December` and `January`

## Section 3

### Creating geographical co-ordinates data file

In [None]:
temp_df = pd.DataFrame(ticket_price_df['destination'].unique())
temp_df.columns = ['city']
location_df = pd.read_csv('es.csv', encoding= 'unicode_escape')
geo_data_df = pd.merge(temp_df, location_df, how='outer', on = 'city')
geo_data_df.sample(5)

### Added No. of trips started from particular city

In [None]:
df = pd.DataFrame(ticket_price_df['origin'].value_counts())
df.reset_index(inplace = True)
df.columns = ['city','count']
geo_data_origin_df = pd.merge(df,geo_data_df, how='left', on='city')
# geo_data_origin_df.sample(5)
geo_data_origin_df.sort_values(by='count', ascending=False).head(5)

In [None]:
df = geo_data_origin_df
m = folium.Map(location = [40.4637, 1], tiles ='OpenStreetMap',    
    zoom_start=6)
for i, row in df.iterrows():
    lat = df.at[i, 'lat']
    lng = df.at[i, 'lng']
    
    iframe = str('''<u><h4> City :</h4></u>'''+ df.at[i, 'city']  +''' <br> <u><h4> No.of Tirps Started :</h4></u>''' + str(df.iloc[i]['count'])) 
    popup = folium.Popup(iframe,
                     max_width=200)
    folium.Marker(location = [lat, lng], popup= popup,fill_color='#43d9de',icon=plugins.BeautifyIcon(
                         icon="arrow-down", icon_shape="marker",
                         number=str(df.iloc[i]['count']),
                         border_color= 'grey',
                         background_color= '#43d9de')).add_to(m)
# m.save('trip_origin.html')
m

Here we can see that maximum number of trips started from `Madrid` which lies in the center

### Added No. of trips ended to particular city

In [None]:
df = pd.DataFrame(ticket_price_df['destination'].value_counts())
df.reset_index(inplace = True)
df.columns = ['city','count']
geo_data__dest_df = pd.merge(df,geo_data_df, how='left', on='city')
# geo_data__dest_df.sample(5)
geo_data__dest_df.sort_values(by='count', ascending=False).head(5)

In [None]:
df = geo_data__dest_df
m = folium.Map(location = [40.4637, 1], tiles ='OpenStreetMap',    
    zoom_start=6)
for i, row in df.iterrows():
    lat = df.at[i, 'lat']
    lng = df.at[i, 'lng']
    
    iframe = str('''<u><h4> City :</h4></u>'''+ df.at[i, 'city']  +''' <br> <u><h4> Destination of Tirps :</h4></u>''' + str(df.iloc[i]['count'])) 
    popup = folium.Popup(iframe,
                     max_width=200)
    folium.Marker(location = [lat, lng], popup= popup,fill_color='#43d9de',icon=plugins.BeautifyIcon(
                         icon="arrow-down", icon_shape="marker",
                         number=str(df.iloc[i]['count']),
                         border_color= 'grey',
                         background_color= '#43d9de')).add_to(m)
# m.save('trip_origin.html')
m

Here we can see that maximum number of trips ended on `Madrid` which lies in the center and is connected to all the cities

## Section 4

#### Creating two base categories `Standard Tourist Class` & `First Class` for vehicle class.

In [None]:
ticket_price_df = ticket_price_df.replace({'vehicle_class':{'Turista':'Tourist standard class','Preferente':'First class','Turista con enlace':'Tourist standard class',
                                                            'Turista Plus':'Tourist standard class','TuristaSólo plaza H':'Tourist standard class',
                                                            'Turista Plus - Turista': 'Tourist standard class', 'PreferenteSólo plaza H':'First class'
                                         }})

> Creating `Buckets` for duration of trip

In [None]:
ticket_price_df['duration_b'] = pd.DataFrame(pd.cut(ticket_price_df['duration'],8))

In [None]:
matplotlib.rcParams['figure.figsize'] = (20,10)
tourist_class_df1 = ticket_price_df[ticket_price_df['vehicle_class'] == 'Tourist standard class']
sns.stripplot(data=tourist_class_df1, x="duration_b", y="price")
plt.xlabel('Duration (Hrs)')
plt.ylabel('Price (Euro)')
plt.title('Duration Vs Price (Standard Class)')

The people travelling by `Tourist Standard Class` have a travelling time between `1 to 7 hrs` and over different train types the price for the same duration varies alot i.e from `€20` to `€90`

In [None]:
matplotlib.rcParams['figure.figsize'] = (20,10)
tourist_class_df1 = ticket_price_df[ticket_price_df['vehicle_class'] == 'First class']
sns.stripplot(data=tourist_class_df1, x="duration_b", y="price")
plt.xlabel('Duration (Hrs)')
plt.ylabel('Price (Euro)')
plt.title('Duration Vs Price (Standard Class)')

For the `First class` very few people travell for a duration longer than`3.3 hrs`, and most of the people travell from `1 hrs` to `2.5 hrs` and `2.5 hrs` to `3.3 hrs`

In [None]:
ticket_price_df['weekday'] = ticket_price_df['departure'].dt.day_name()
ticket_price_df['weekday_no'] = ticket_price_df['departure'].dt.dayofweek

In [None]:
ticket_price_df['hour'] = ticket_price_df['departure'].dt.hour

In [None]:
ticket_price_df.sample(3)

In [None]:
trains = (ticket_price_df[['duration','vehicle_type','price']]).round(0)
# trains = trains[trains['price'].notna()]
df_heatmap = trains.pivot_table(values='price',index='vehicle_type',columns='duration')
sns.heatmap(df_heatmap, linewidths = .5, cmap="YlGnBu")

We can see here that the price for some trains are quite high even for the shorter duration and that is beacause of train's feature
> but the most expensive train is `AVE-TGV` which have a maximum trip time of `4 hrs` 
> and for the duration of `7 hrs` the minimum cost is of train `AVE-LD`

In [None]:
ticket_price_df['duration'].astype(str)


#### Most number of trips between...

In [None]:
ticket_price_df['origin_destination'] = ticket_price_df[['origin', 'destination']].agg('_'.join, axis=1)
brand_display_text= ticket_price_df.origin_destination.str.cat(sep=" ")
matplotlib.rcParams['figure.figsize'] = (15, 8)
# Create the wordcloud object
wordcloud = WordCloud(width=800, height=600, margin=0, background_color='white').generate(brand_display_text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

## **Summary: Spain Rail Ticket Pricing** 

We analysed the behaviour of Railway ticket pricing using Python, Pandas, Matplotlib and Seaborn. Here is a summary of the key insights for the Dataset.

Key Metrics:

- Even thou the price of travelling by `AVE` train is quite high the people travell alot by it so we can increase the number of `AVE` trains

- Maximum amount of earning is during the month of `March`, and very low during the month of `January`, `June`, `September` and `December`, so we can decrease the number of train during the months where earning is low and increase for the month of `January`

- Very few people travel more than `3.3 hrs` while travelling by First class, so we can reduce the number of First class and increase the Standard Class in the long journey train trips.

We also discovered the following insights from our exploratory data analysis
- For the long distance travelling the train `AVE-LD` is pocket friendly(low in cost) and we have many passengers travelling by this train

## References
- Jovian tutorials
  - [Analyzing Tabular Data with Pandas](https://jovian.ai/aakashns/python-pandas-data-analysis)
  - [Data Visualization using Python, Matplotlib and Seaborn](https://jovian.ai/aakashns/python-matplotlib-data-visualization)
  - [Advanced Data Analysis Techniques with Python & Pandas](https://jovian.ai/aakashns/advanced-data-analysis-pandas)
  - [Exploratory Data Analysis Case Study - Stack Overflow Developer Survey](https://jovian.ai/aakashns/stackoverflow-survey-exploratory-data-analysis)
  -  [Interactive Visualization with Plotly](https://jovian.ai/aakashns/interactive-visualization-plotly)
- [10 Key Metrics You MUST Know When Working with Web Data](https://www.youtube.com/watch?v=ZO-YwkVk8Vo) by Eric Sims
- EDA - [Kaggle code](https://www.kaggle.com/datasets/thegurusteam/spanish-high-speed-rail-system-ticket-pricing) by THE GURUS
- [Stackoverflow](https://stackoverflow.com/) hacks, links throughout the notebook
- [Geeks for Geeks](https://www.geeksforgeeks.org/)
- [Medium](https://python.plainenglish.io/using-folium-to-map-latitude-and-longitude-491f8dcc81ad/)
- [Seborn](https://seaborn.pydata.org/)
- [Pandas](https://pandas.pydata.org/docs/)
- [Handling Missing Values](https://towardsdatascience.com/data-cleaning-how-to-handle-missing-values-in-pandas-cc8570c446ec/)

In [None]:
#Save our work
jovian.commit()