# Exploratory Data Analysis of Past and Future Solar Eclipses 



 Data Source - https://data.world/nasa/five-millennium-catalog-of-solar-eclipses-detailed/workspace/intro

Analysis and Hypothesis: My purpose in selecting the dataset is I like to read and explore stars, planets, and the solar system as a whole. So from the dataset, these are the following hypotheses I would like to present. 

1. No of Eclipses - a graph displaying the total number of eclipses in each time range of which the dataset limit is declared. 

2. Frequency - How many times in a particular year has an eclipse situation occurred and comparison of frequencies of the 5 hundred year brackets. 

3. Location and Types - Typical location of eclipses given their type. Prediction of the next eclipse.

4. Duration: The mean of durations of solar eclipses. 


Column     Heading     Definition/Description
          
   1       Catalog     Sequential number of the eclipse in the catalog links to
           Number      the map published in the 
                       Five Millennium Canon of Solar Eclipses: -1999 to +3000. 

   2      Calendar     Calendar Date at instant of Greatest Eclipse. 
            Date       Gregorian Calendar is used for dates after 1582 Oct 15.  
                       Julian Calendar is used for dates before 1582 Oct 04.   

   3         TD of     Dynamical Time (TD) of Greatest Eclipse, the instant 
           Greatest    when the axis of the Moon's shadow cone passes closest
           Eclipse     to Earth's center.

   4         ΔT        Delta T (ΔT) is the arithmetic difference between 
                       Dynamical Time and Universal Time. It is a measure of 
                       the accumulated clock error due to the variable 
                       rotation period of Earth.

   5        Luna       Lunation Number is the number of synodic months since 
             Num       New Moon of 2000 Jan 06. The Brown Lunation Number 
                       can be determined by adding 953.

   6        Saros      Saros series number of eclipse.
             Num       (Each eclipse in a Saros is separated by an interval
                        of 18 years 11.3 days.)

   7        Ecl.       Eclipse Type where:
            Type         P  = Partial Eclipse.
                         A  = Annular Eclipse.
                         T  = Total Eclipse.
                         H  = Hybrid or Annular/Total Eclipse.

References - 
https://towardsdatascience.com/geopandas-101-plot-any-data-with-a-latitude-and-longitude-on-a-map-98e01944b972
https://stackoverflow.com/questions/48042915/sort-a-pandas-dataframe-series-by-month-name
https://plotly.com/python/choropleth-maps/
https://gis.stackexchange.com/questions/353724/error-when-converting-a-pandas-dataframe-to-a-geodataframe
https://www.kaggle.com/yashgpt/choropleth-maps-geographic-visualization
https://stackoverflow.com/questions/45574099/plot-different-columns-of-different-dataframe-in-the-same-plot-with-pandas

In [None]:
#!pip install chart-studio


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import descartes
from chart_studio import plotly
import plotly.express as px
import geopandas as gpd
from geopandas import GeoDataFrame
from pathlib import Path
from datetime import date
from pandas.tseries.offsets import MonthEnd
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode,iplot
init_notebook_mode(connected=True)

In [None]:
#1901 - 2000 
url1 = 'https://drive.google.com/file/d/1dBnA5O9c0Nf1myhbEhlYSYLjaXcqiB7p/view?usp=sharing'
path1 = 'https://drive.google.com/file/d/1dBnA5O9c0Nf1myhbEhlYSYLjaXcqiB7p/view?usp=sharing  + url1.split('/')[-2]'
df1 = pd.read_csv('path1', error_bad_lines=False)
df1.head()

In [None]:
#dropping unwanted columns
df2=df1.drop(columns=['ΔT s','Unnamed: 18','Gamma'])
df2.head()

In [None]:
#imputing the central duration values using mean
central_duration_mean = df2['Central Dur.'].mean()


In [None]:
df2['Central Dur.'].fillna(central_duration_mean,inplace=True)

In [None]:
df2.tail()

In [None]:
#imputing the saros num values using mean 
saros_mean = df2['Saros Num'].mean()

In [None]:
df2['Saros Num'].fillna(saros_mean,inplace=True)

In [None]:
df2.head()

In [None]:
df3 = df2.rename(columns={'Calendar Month': 'CalendarMonth'})
df3.head()

In [None]:
#sorting the Calendar Month Column
df4 = df3['CalendarMonth'] = pd.Categorical(df3['CalendarMonth'], categories = CalendarMonth, ordered=True)
df4.sort_values(...)

In [None]:
df4.plot(subplots=True, figsize=(6, 6));

In [None]:
#types of eclipses as per month and year 
p1 = sns.swarmplot(x="Calendar Month", y="Calendar Year", hue="QLE", data=df3)
p1.legend_.remove()
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

Conclusion - Partial eclipse is the majority occurrences with certain occurences falling out of the specified criterias

In [None]:
#Plotting the duration using KDE plot
plt.figure(figsize=(8,6))
sns.kdeplot(
   data=df3, x="Central Dur.", hue="QLE",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)


Conclusion - The central duration of different QLE is the Partial Eclipse then the Total eclipse.

In [None]:
#downloaded the countries.geojson file
from shapely.geometry import Point
geometry = .points_from_xy(df3['Lat °'].astype('float32'), df3['Long °'].astype('float32'))
gdf = GeoDataFrame(df3, geometry=geometry)   

#this is a simple map that goes with geopandas
world = gpd.read_file(gpd.datasets.get_path('countries.geojson'))
gdf.plot(ax=world.plot(figsize=(10, 6)), marker='o', color='red', markersize=15);
fig.show()

In [None]:
import plotly.graph_objects as go


df4 = pd.read_csv('1901-2000.csv')

fig = go.Figure(data=go.Choropleth(
    locations = df4['Lat °'],
    z = df4['QLE'],
    text = df4['Sun Alt °'],
    colorscale = 'Blues',
    autocolorscale=False,
    reversescale=True,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_tickprefix = '',
    colorbar_title = 'QLE',
))

fig.update_layout(
    title_text='Solar Eclipses<br>Locations',
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    annotations = [dict(
        x=0.55,
        y=0.1,
        xref='paper',
        yref='paper',
        text='Source: <countries.geojson>',
        showarrow = False
    )]
)

fig.show()


In [None]:
#Specifically for USA region
df2 = pd.read_csv("1901-2000.csv",
                   dtype={"Long °": str})



fig = px.choropleth(df2, geojson=counties, locations='Long °', color='QLE',
                           color_continuous_scale="Viridis",
                           range_color=(0, 12),
                           scope="usa",
                           labels={'QLE':'Type of Eclipses'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
import plotly.express as px

df2 = px.data.gapminder().query("Calendar Year==1997")
fig = px.choropleth(df, locations="iso_alpha",
                    color="QLE", # QLE is a column of gapminder
                    hover_name="country", # column to add to hover information
                    color_continuous_scale=px.colors.sequential.Plasma)
fig.show()

In [None]:
#2001 - 2100
url2 = 'https://drive.google.com/file/d/1d0rfJxQUIIAYIXKYJ7xbcpOOV1SFwdBr/view?usp=sharing'
path2 = 'https://drive.google.com/file/d/1d0rfJxQUIIAYIXKYJ7xbcpOOV1SFwdBr/view?usp=sharing  + url2.split('/')[-2]'
df5 = pd.read_csv('path2', error_bad_lines=False)
df5.head()

In [None]:
#dropping unwanted columns
df6 = df5.drop(columns=['ΔT s','Gamma','Saros Num','Ecl. Mag.'])


In [None]:
df6.head()

In [None]:
df6.replace({'#NAME?': 0})

In [None]:
df7 = df6.replace({'NaN':0})


In [None]:
df7.head()

In [None]:
plt.figure(figsize=(8,6))
p2 = sns.swarmplot(x="Calendar Month", y="Calendar Year", hue="QLE", data=df7)
p2.legend_.remove()
plt.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

Conclusion - The future eclipses between the year 2020 and 2040 will be total eclipses and no eclipses in the month of April,October,May and Novemeber

In [None]:
#Plotting the duration month wise using KDE plot
plt.figure(figsize=(8,6))
sns.kdeplot(
   data=df2, x="Central Dur.", hue="Calendar Month",
   fill=True, common_norm=False, palette="crest",
   alpha=.5, linewidth=0,
)



In [None]:
df7.plot(subplots=True, figsize=(6, 6));

In [None]:
#plotting monthly data over years
for Calendar_Month in df7():
    data = df7[df7.Calendar_Month == month]  # filter and plot the data for a specific month
    plt.figure()  # create a new figure for each month
    sns.lineplot(data.Central_Dur, Path_Width_km, marker='QLE')
    plt.xlim(date(2011, 1, 1), date(2021, 1, 1))
    plt.title(f'Month: {month}')
    plt.ylabel('df2: PPB')
    plt.xlabel('Year')

In [None]:
for k, v in df.groupby('Calendar Month'):  # group the dateframe by month
    plt.figure(figsize=(10, 20))

    sns.barplot(x=v.r_mean, y=v.day, ci=None, orient='h', hue=v.index.year)
    plt.title(f'Month: {k}')
    plt.ylabel('Day of Month')
    plt.legend(bbox_to_anchor=(1.04,0.5), loc="center left", borderaxespad=0)
plt.show()

In [None]:
data = dict(
        type = 'choropleth',
        colorscale = 'Viridis',
        reversescale = True,
        locations = df7['Central Dur.'],
        locationmode = "country names",
        z = df7['QLE'],
        text = df7['Central Dur.'],
        colorbar = {'title' : 'QLE'},
      ) 

layout = dict(title = '2014 Solar Eclipse Location',
                geo = dict(showframe = False,projection = {'type':'stereographic'})
             )
choromap = go.Figure(data = [data],layout = layout)
iplot(choromap,validate=False)


In [None]:
#Comparing the Central Duration and path width in km for both the centuries
a = np.linspace(-3,3, 11)
data1 = np.sort(np.random.rand(len(a),3))
data1[:,0] =a 
data2 = np.sort(np.random.rand(len(a),3))*10
data2[:,0] =a 
df4 = pd.DataFrame(data1, columns=["Central Dur.", "Time", "Path Width Km"])
df7 = pd.DataFrame(data2, columns=["Central Dur.", "Time", "Path Width Km"])

fig, ax = plt.subplots()
ax2 = ax.twinx()

df4.plot(x="Central Dur.", y=["Time", "Path Width Km"], ax=ax)
df7.plot(x="Central Dur.", y=["Time", "Path Width Km"], ax=ax2, ls="--")

plt.show()

Conclusion - The central duration of the future eclipses will be low and the path travelled will also take a dip.

In [None]:
#Calculate the mean 
df2['Central Dur.'].mean()


In [None]:
#calculate the mean 
df7['Central Dur.'].mean()

In [None]:
#Central Dur.1 = 1901-2000
#Central Dur.2 = 2001-2100
data = [df2["Central Dur."], df7["Central Dur."]]

headers = ["Central Dur.1", "Central Dur.2"]

df10 = pd.concat(data, axis=1, keys=headers)

print(df10)

In [None]:
df10.replace({'NaN':0})