# Re-analysing Yim, Shao & Xu (2024) and using machine learning to learn more about glitch distributions

This notebook is for the analysis of ATNF pulsar and JBCA and ATNF glitch data. We will re-do the analysis of [Yim, Shao & Xu (2024)](https://academic.oup.com/mnras/article/532/4/3893/7712489) but using improved code, including:

- Webscrapping for data so we always have the latest updates included
- Handling the data using Pandas which makes the code more readable and execute faster
- Explore the pulsar dataset more to find what pulsar features are correlated
- Write code to determine the glitch size distribution and waiting time distribution
- Use machine learning to determine the which features can be used to predict the above distributions

The project will be divided into two main parts:

*PART I* (reproducing [Yim, Shao & Xu (2024)](https://academic.oup.com/mnras/article/532/4/3893/7712489))
- Loading the data (using the self-written Python module, <code>read_catalogues.py<code>)
- Cleaning the data
- Exploring/processing/applying mathematical models to the data
- Visualising the results

*PART II* (applying machine learning to determine glitch size and waiting time distributions)
- Determining the each pulsar's actual distribution for glitch sizes and waiting times (creating labels for training and testing)
- Processing the data so it is in a suitable format for applying machine learning models
- Applying different machine learning models
- Evaluating different machine learning models

---

# PART I - Re-analysing Yim, Shao & Xu (2024)

## Importing libraries and modules

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

import read_catalogues # Self-written

---

## Loading the data

Loading glitch data from the JBCA Glitch Catalogue:

In [8]:
df_glitch_JBCA = read_catalogues.read_JBCA_glitch_catalogue()
df_glitch_JBCA

Unnamed: 0,Pulsar name,J-name,No.,MJD,+/-,dF/F,+/-.1,dF1/F1,+/-.2,References
0,J0007+7303,0007+7303,1,54953,X,554,1,1.0,0.1,"Abdo+2012 [awd+12], also in Ray+2011 [rkp+11]"
1,J0007+7303,0007+7303,2,55466,X,1260,X,X,X,Belfore+2011 [3rd Fermi symp.]
2,J0040-7335,0040-7335,1,59919.7,X,1.31,0.18,0.056,0.025,New. Also in Carli+24 [cab+24]
3,J0040-7335,0040-7335,2,60355.8,X,1.9,0.4,0.68,0.11,New
4,J0040-7337,0040-7337,1,60013.13,0.05,1810,X,7,X,Carli+24 [cab+24]
...,...,...,...,...,...,...,...,...,...,...
723,1E_2259+586,2301+5852,5,54880,X,-14,1,-29.3,22.2,Icdem+2012 [ibi12]
724,B2323+63,2325+6316,1,53957,31,0.21,0.02,-0.32,0.04,Basu+2021 [bsa+21]
725,B2334+61,2337+6151,1,53642,13,20470,1,23.8,0.4,"Espinoza+2011 [elsk11], also in Yuan+2010 [ymw..."
726,J2346-0609,2346-0609,1,57495,2,0.55,0.01,2.4,0.4,Basu+2021 [bsa+21]


In [9]:
df_glitch_JBCA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 728 entries, 0 to 727
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Pulsar name  728 non-null    object
 1   J-name       728 non-null    object
 2   No.          728 non-null    object
 3   MJD          728 non-null    object
 4   +/-          728 non-null    object
 5   dF/F         728 non-null    object
 6   +/-          728 non-null    object
 7   dF1/F1       728 non-null    object
 8   +/-          728 non-null    object
 9   References   728 non-null    object
dtypes: object(10)
memory usage: 57.0+ KB


The JBCA Glitch Catalogue has 712 glitch entries in total across 10 different columns (2025/11/09). 728 glitch entries on 2025/12/08. We will need to convert certain columns to the correct data type, i.e. changing object (string) to float. Although there are apparently no non-null entries, we see above that any null entries are marked by a 'X'. We will make sure to change these into actual null entries shortly.

Download the ATNF files if they are not in the current working directory:

In [12]:
if not os.path.exists('psrcat_pkg.tar.gz'):
    read_catalogues.download_ATNF_catalogues()

Loading glitch data from the ATNF Glitch Catalogue:

In [14]:
df_glitch_ATNF = read_catalogues.read_ATNF_glitch_catalogue()
df_glitch_ATNF

Unnamed: 0,Name,J2000 Name,Glitch Epoch,+/-,dF_F,+/-.1,dF1_F1,+/-.2,Q,+/-.3,T_d,+/-.4,Ref.
0,J0007+7303,J0007+7303,54952.652,-,553.7,0.6,0.97,0.06,-,-,-,-,awd+12
1,B0144+59,J0147+5922,53682,15,0.056,0.003,-0.21,0.05,-,-,-,-,ywml10
2,B0154+61,J0157+6212,58283,3,2.6,0.3,-,-,-,-,-,-,bsa+22
3,J0146+6145,J0146+6145,51141,248,650,150,14,5,-,-,-,-,mks05
4,J0146+6145,J0146+6145,53809.185840,-,1630,350,5100,1100,1.1,0.3,17.0,1.1,gdk11
...,...,...,...,...,...,...,...,...,...,...,...,...,...
639,J2301+5852,J2301+5852,56125,2,260,50,-2600,200,-,-,-,-,akn+13
640,B2323+63,J2325+6316,53957,31,0.21,0.02,-0.32,0.04,-,-,-,-,bsa+22
641,B2334+61,J2337+6151,53615,6,20579.4,1.2,156,4,0.0046,0.0007,21.4,0.5,ymw+10
642,-,-,-,-,-,-,-,-,0.0029,0.0001,147,2,ymw+10


In [15]:
df_glitch_ATNF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name          644 non-null    object
 1   J2000 Name    644 non-null    object
 2   Glitch Epoch  644 non-null    object
 3   +/-           644 non-null    object
 4   dF_F          644 non-null    object
 5   +/-           644 non-null    object
 6   dF1_F1        644 non-null    object
 7   +/-           644 non-null    object
 8   Q             644 non-null    object
 9   +/-           644 non-null    object
 10  T_d           644 non-null    object
 11  +/-           644 non-null    object
 12  Ref.          644 non-null    object
dtypes: object(13)
memory usage: 65.5+ KB


The JBCA Glitch Catalogue has 644 glitch entries in total across 12 different columns (2025/11/09). Still 644 on 2025/12/08. Some of these entries are for the same glitch but for several different recovery parameters, e.g. if a glitch had two recovery timescales. For these multi-exponential recovery glitches, we will treat each recovery independently.

Like before, we will need to convert certain columns to the correct data type, i.e. changing object (string) to float. Although there are apparently no non-null entries, we see above that any null entries are marked by a '-'. We will make sure to change these into actual null entries shortly.

Loading pulsar data from the ATNF Pulsar Catalogue:

In [18]:
df_pulsars = read_catalogues.read_ATNF_pulsar_catalogue()
df_pulsars

Unnamed: 0,A1,A12DOT,A1DOT,A1_2,A1_3,ASSOC,BINARY,BINCOMP,CLK,DECJ,...,T0,T0_2,T0_3,TASC,TASC_2,TAU_SC,TYPE,UNITS,W10,W50
0,,,,,,"GRS:4FGL_J0002.8+6217[aab+22],XRS:1XSPS_J00025...",,,,+62:16:09.4,...,,,,,,,HE[wcp+18],,,
1,,,,,,,,,,+18:34:59,...,,,,,,,,,112.1,61.3
2,,,,,,"GRS:4FGL_J0007.0+7303[aab+22],XRS:RX_J0007.0+7...",,,,+73:03:07.4,...,,,,,,,NRAD[aab+22],,,
3,,,,,,,,,,+08:10,...,,,,,,,,,53,13
4,,,,,,,,,TT(BIPM2019),+54:31:40,...,,,,,,,RRAT[dcm+23],,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4346,,,,,,,,,,-22:51:53,...,,,,,,,,,21,9
4347,8.8929760,,,,,,ELL1,He[mzl+23],,+00:51:09.57,...,59258.1366884,,,,,,,TDB,1.7,0.5
4348,,,,,,,,,,04:43,...,,,,,,,,,,
4349,,,,,,,,,TT(BIPM2019),+15:23:19,...,,,,,,,RRAT[dcm+23],,,


In [19]:
df_pulsars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4351 entries, 0 to 4350
Columns: 162 entries, A1 to W50
dtypes: object(162)
memory usage: 5.4+ MB


We see that there are 4351 entries (pulsars) each with up to 162 features (2025/11/09). Still 4351 entries on 2025/12/08. Shortly, we will pull out only the features that are important to us. We will convert some columns into the correct data type too.

---

## Cleaning the data

### JBCA Glitch Catalogue

#### Changing header and data types

In [25]:
# Changing all X's into NaN
df_glitch_JBCA = df_glitch_JBCA.replace(['X', 'x'], pd.NA)

# Changing the column names
headers = ['pulsar_name', 'J_name', 'pulsar_glitch_number', 'MJD', 'MJD_err', 'dF_F', 'dF_F_err', 'dF1_F1', 'dF1_F1_err', 'references']
df_glitch_JBCA.columns = headers

# Changing each column to its correct data type - use Pandas dtypes as they support NaN values (native Python/NumPy does not)
dtype_map = {
    'pulsar_name' : 'string',
    'J_name' : 'string', 
    'pulsar_glitch_number' : 'Int64', 
    'MJD' : 'Float64', 
    'MJD_err' : 'Float64', 
    'dF_F' : 'Float64', 
    'dF_F_err' : 'Float64', 
    'dF1_F1' : 'Float64', 
    'dF1_F1_err' : 'Float64', 
    'references' : 'string'
}
df_glitch_JBCA = df_glitch_JBCA.astype(dtype_map)

df_glitch_JBCA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 728 entries, 0 to 727
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   pulsar_name           728 non-null    string 
 1   J_name                727 non-null    string 
 2   pulsar_glitch_number  728 non-null    Int64  
 3   MJD                   728 non-null    Float64
 4   MJD_err               681 non-null    Float64
 5   dF_F                  726 non-null    Float64
 6   dF_F_err              708 non-null    Float64
 7   dF1_F1                610 non-null    Float64
 8   dF1_F1_err            605 non-null    Float64
 9   references            728 non-null    string 
dtypes: Float64(6), Int64(1), string(3)
memory usage: 62.0 KB


We see that there are some issues, for example, there is a missing J-name. We will correct that. There are also two glitches that do not have a dF_F reading, we will get rid of those. 

#### Finding the null J-name and correcting it

In [28]:
df_glitch_JBCA[df_glitch_JBCA['J_name'].isnull()]

Unnamed: 0,pulsar_name,J_name,pulsar_glitch_number,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,references
567,AX_1838.0-0655,,1,55010.0,4.0,1550.0,70.0,,,Kuiper+2010 [kh10]


In [29]:
df_glitch_JBCA.loc[df_glitch_JBCA['J_name'].isnull(), 'J_name'] = '1838-0655'
df_glitch_JBCA.loc[df_glitch_JBCA['pulsar_name'] == 'AX_1838.0-0655', 'pulsar_name'] = 'J1838-0655'

In [30]:
df_glitch_JBCA.loc[df_glitch_JBCA['J_name'] == '1838-0655', :]

Unnamed: 0,pulsar_name,J_name,pulsar_glitch_number,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,references
567,J1838-0655,1838-0655,1,55010.0,4.0,1550.0,70.0,,,Kuiper+2010 [kh10]


#### Finding the null dF/F values and removing them

In [32]:
df_glitch_JBCA[df_glitch_JBCA['dF_F'].isnull()]

Unnamed: 0,pulsar_name,J_name,pulsar_glitch_number,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,references
307,B1338-62,1341-6220,34,58178.0,15.0,,,,,Lower+2021 [ljd+21]
308,B1338-62,1341-6220,35,58214.0,17.0,,,,,Lower+2021 [ljd+21]


In [33]:
df_glitch_JBCA = df_glitch_JBCA.dropna(subset=['dF_F']).reset_index(drop=True)
df_glitch_JBCA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 726 entries, 0 to 725
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   pulsar_name           726 non-null    string 
 1   J_name                726 non-null    string 
 2   pulsar_glitch_number  726 non-null    Int64  
 3   MJD                   726 non-null    Float64
 4   MJD_err               679 non-null    Float64
 5   dF_F                  726 non-null    Float64
 6   dF_F_err              708 non-null    Float64
 7   dF1_F1                610 non-null    Float64
 8   dF1_F1_err            605 non-null    Float64
 9   references            726 non-null    string 
dtypes: Float64(6), Int64(1), string(3)
memory usage: 61.8 KB


#### Checking if all dF/F is positive, remove any rows that are not 

In [35]:
df_glitch_JBCA[df_glitch_JBCA['dF_F'] < 0]

Unnamed: 0,pulsar_name,J_name,pulsar_glitch_number,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,references
721,1E_2259+586,2301+5852,5,54880.0,,-14.0,1.0,-29.3,22.2,Icdem+2012 [ibi12]


In [36]:
df_glitch_JBCA = df_glitch_JBCA[df_glitch_JBCA['dF_F'] >= 0].reset_index(drop=True)
df_glitch_JBCA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   pulsar_name           725 non-null    string 
 1   J_name                725 non-null    string 
 2   pulsar_glitch_number  725 non-null    Int64  
 3   MJD                   725 non-null    Float64
 4   MJD_err               679 non-null    Float64
 5   dF_F                  725 non-null    Float64
 6   dF_F_err              707 non-null    Float64
 7   dF1_F1                609 non-null    Float64
 8   dF1_F1_err            604 non-null    Float64
 9   references            725 non-null    string 
dtypes: Float64(6), Int64(1), string(3)
memory usage: 61.7 KB


#### Renaming the J-name of J1844+00 to J1844+0034 

In [38]:
# df_glitch_JBCA.loc[df_glitch_JBCA['J_name'] == '1844+00']

In [39]:
# df_glitch_JBCA.loc[df_glitch_JBCA['J_name'] == '1844+00', 'J_name'] = '1844+0034'

In [40]:
# df_glitch_JBCA.loc[df_glitch_JBCA['J_name'] == '1844+0034', :]

#### Adding 'J' to the start of all J-names

In [42]:
does_not_have_J = ~df_glitch_JBCA['J_name'].str.startswith('J') # Creates a Boolean mask that tests whether J_name starts with 'J', the tilde is a NOT operator so exchanges True <--> False  
df_glitch_JBCA.loc[does_not_have_J, 'J_name'] = 'J' + df_glitch_JBCA.loc[does_not_have_J, 'J_name']

df_glitch_JBCA.head()

Unnamed: 0,pulsar_name,J_name,pulsar_glitch_number,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,references
0,J0007+7303,J0007+7303,1,54953.0,,554.0,1.0,1.0,0.1,"Abdo+2012 [awd+12], also in Ray+2011 [rkp+11]"
1,J0007+7303,J0007+7303,2,55466.0,,1260.0,,,,Belfore+2011 [3rd Fermi symp.]
2,J0040-7335,J0040-7335,1,59919.7,,1.31,0.18,0.056,0.025,New. Also in Carli+24 [cab+24]
3,J0040-7335,J0040-7335,2,60355.8,,1.9,0.4,0.68,0.11,New
4,J0040-7337,J0040-7337,1,60013.13,0.05,1810.0,,7.0,,Carli+24 [cab+24]


#### Summary after cleaning

In [44]:
df_glitch_JBCA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   pulsar_name           725 non-null    string 
 1   J_name                725 non-null    string 
 2   pulsar_glitch_number  725 non-null    Int64  
 3   MJD                   725 non-null    Float64
 4   MJD_err               679 non-null    Float64
 5   dF_F                  725 non-null    Float64
 6   dF_F_err              707 non-null    Float64
 7   dF1_F1                609 non-null    Float64
 8   dF1_F1_err            604 non-null    Float64
 9   references            725 non-null    string 
dtypes: Float64(6), Int64(1), string(3)
memory usage: 61.7 KB


In [45]:
df_glitch_JBCA['J_name'].nunique()

223

After cleaning, we have 725 glitches from 223 unique pulsars. Note: If we count the unique cases by MJD, we get 703 unique cases, but this is because 22 MJD values are not unique.

In [47]:
df_glitch_JBCA['MJD'].duplicated().value_counts()

MJD
False    703
True      22
Name: count, dtype: int64

In [48]:
df_glitch_JBCA[df_glitch_JBCA['MJD'].duplicated()]

Unnamed: 0,pulsar_name,J_name,pulsar_glitch_number,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,references
220,J1019-5749,J1019-5749,1,55595.0,10.0,1.33,0.4,0.12,0.22,Lower+2021 [ljd+21]
224,J1023-5746,J1023-5746,1,55043.0,8.0,3570.0,1.0,10.62,0.07,Gügercinoğlu+22 [ggyz22]; also in Belfore+2011...
255,J1112-6103,J1112-6103,3,55288.0,7.0,1793.0,1.0,6.0,2.0,Lower+2021 [ljd+21]
327,J1413-6141,J1413-6141,11,56975.0,8.0,30.0,2.0,,,Lower+2021 [ljd+21]
334,J1420-6048,J1420-6048,4,54652.0,20.0,937.1,0.3,2.95,0.01,"Weltevrede+2011 [wje11], also in Yu+2012 [ymh+..."
358,J1617-5055,J1617-5055,6,56267.0,6.0,2068.0,2.0,13.2,0.7,Lower+2021 [ljd+21]
365,B1643-43,J1646-4346,2,55288.0,7.0,8591.0,6.0,16.0,9.0,Lower+2021 [ljd+21]
366,CXO_J164710.2-455216,J1647-4552,1,53999.0,,65000.0,3000.0,,,Israel+2007 [icd+07]
388,1RXS_J1708-4009,J1708-4008,4,53366.0,,572.0,66.0,12.0,8.0,Dib+2008 [dkg08]
405,B1727-47,J1731-4744,5,56975.0,8.0,6.4,0.3,,,Lower+2021 [ljd+21]


### ATNF Glitch Catalogue

#### Changing header and data types

In [51]:
# Changing all hyphens and asterisks into NaNs
df_glitch_ATNF = df_glitch_ATNF.replace(['-', '*'], pd.NA)

# Changing the column names
headers = ['pulsar_name', 'J_name', 'MJD', 'MJD_err', 'dF_F', 'dF_F_err', 'dF1_F1', 'dF1_F1_err', 'Q', 'Q_err', 'T_d', 'T_d_err', 'references']
df_glitch_ATNF.columns = headers

# Remove '[s]' (a string) from some MJD dates and their errors (which should be float) 
df_glitch_ATNF['MJD'] = df_glitch_ATNF['MJD'].str.replace('[s]', '')
df_glitch_ATNF['MJD_err'] = df_glitch_ATNF['MJD_err'].str.replace('[s]', '')

# Changing each column to its correct data type - this throws an error since there are some letters ([s]) in the MJD values
dtype_map_2 = {
    'pulsar_name' : 'string',
    'J_name' : 'string', 
    'MJD' : 'Float64', 
    'MJD_err' : 'Float64', 
    'dF_F' : 'Float64', 
    'dF_F_err' : 'Float64', 
    'dF1_F1' : 'Float64', 
    'dF1_F1_err' : 'Float64', 
    'Q' : 'Float64', 
    'Q_err' : 'Float64', 
    'T_d' : 'Float64', 
    'T_d_err' : 'Float64', 
    'references' : 'string'
}
df_glitch_ATNF = df_glitch_ATNF.astype(dtype_map_2)

df_glitch_ATNF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 644 entries, 0 to 643
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pulsar_name  626 non-null    string 
 1   J_name       626 non-null    string 
 2   MJD          626 non-null    Float64
 3   MJD_err      581 non-null    Float64
 4   dF_F         624 non-null    Float64
 5   dF_F_err     618 non-null    Float64
 6   dF1_F1       552 non-null    Float64
 7   dF1_F1_err   551 non-null    Float64
 8   Q            136 non-null    Float64
 9   Q_err        134 non-null    Float64
 10  T_d          139 non-null    Float64
 11  T_d_err      137 non-null    Float64
 12  references   644 non-null    string 
dtypes: Float64(10), string(3)
memory usage: 71.8 KB


In terms of missing data, it seems okay, but it is worth checking the errors for Q and T_d, as they each have 2 errors missing. Also, there seems to be only 626 pulsar names but 644 references. The reason for this is because some entries represent multiple recoveries of a single glitch, where each recovery component has its own row. For example, look at Vela's glitch on MJD 51559.3190:

In [53]:
df_glitch_ATNF.iloc[145:158]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
145,B0833-45,J0835-4510,46259.0,2.0,1598.5,1.5,13.7,1.1,0.0037,0.0005,6.5,0.5,mkhr87
146,,,,,,,,,0.1541,0.0006,332.0,10.0,mkhr87
147,B0833-45,J0835-4510,47519.8036,8e-05,1805.2,0.8,77.0,6.0,0.005385,1e-05,4.62,0.02,mhmk90
148,,,,,,,,,0.1684,0.0004,351.0,1.0,mhmk90
149,B0833-45,J0835-4510,48457.4,1.0,2715.0,2.0,600.0,60.0,,,,,fla91
150,B0833-45,J0835-4510,49559.0,0.2,835.0,2.0,0.0,5.0,,,,,fla94a
151,B0833-45,J0835-4510,49591.82,,199.0,2.0,120.0,20.0,,,,,fla94b
152,B0833-45,J0835-4510,50369.345,0.002,2110.0,17.0,5.95,0.03,0.03,0.004,186.0,12.0,"wmp+00,ymh+13"
153,B0833-45,J0835-4510,51559.319,0.0005,3152.0,2.0,495.0,37.0,0.0088,0.0006,0.53,0.03,dml02
154,,,,,,,,,0.00547,6e-05,3.29,0.03,dml02


#### Ensuring each recovery component has pulsar data 

One can see that there are 4 components for the glitch with different recovery parameters (Q and T_d). To clean this, we will just copy the other data (pulsar_name, J_name, ..., dF1_F1_err) from the first component, as each component has the same properties. 

In [56]:
no_name_indices = df_glitch_ATNF[df_glitch_ATNF['pulsar_name'].isnull()].index

In [57]:
len(no_name_indices) # This gives the number of extra entries due to extra recovery components

18

In [58]:
for index in no_name_indices:
    test_index = index - 1
    df_glitch_ATNF.loc[index, 'pulsar_name': 'dF1_F1_err'] = df_glitch_ATNF.loc[test_index, 'pulsar_name': 'dF1_F1_err']

In [59]:
df_glitch_ATNF.iloc[145:158]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
145,B0833-45,J0835-4510,46259.0,2.0,1598.5,1.5,13.7,1.1,0.0037,0.0005,6.5,0.5,mkhr87
146,B0833-45,J0835-4510,46259.0,2.0,1598.5,1.5,13.7,1.1,0.1541,0.0006,332.0,10.0,mkhr87
147,B0833-45,J0835-4510,47519.8036,8e-05,1805.2,0.8,77.0,6.0,0.005385,1e-05,4.62,0.02,mhmk90
148,B0833-45,J0835-4510,47519.8036,8e-05,1805.2,0.8,77.0,6.0,0.1684,0.0004,351.0,1.0,mhmk90
149,B0833-45,J0835-4510,48457.4,1.0,2715.0,2.0,600.0,60.0,,,,,fla91
150,B0833-45,J0835-4510,49559.0,0.2,835.0,2.0,0.0,5.0,,,,,fla94a
151,B0833-45,J0835-4510,49591.82,,199.0,2.0,120.0,20.0,,,,,fla94b
152,B0833-45,J0835-4510,50369.345,0.002,2110.0,17.0,5.95,0.03,0.03,0.004,186.0,12.0,"wmp+00,ymh+13"
153,B0833-45,J0835-4510,51559.319,0.0005,3152.0,2.0,495.0,37.0,0.0088,0.0006,0.53,0.03,dml02
154,B0833-45,J0835-4510,51559.319,0.0005,3152.0,2.0,495.0,37.0,0.00547,6e-05,3.29,0.03,dml02


#### Checking missing Q_err and T_d_err values

In [61]:
df_glitch_ATNF[df_glitch_ATNF['Q'].notnull() & df_glitch_ATNF['Q_err'].isnull()]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
163,B0833-45,J0835-4510,58515.5929,0.0005,2501.2,3.2,8.69,0.28,0.005,,11.0,1.2,"ker19,lbs+20"
465,J1822-1604,J1822-1604,56756.0,,230.0,10.0,,,1.0,,40.0,6.0,skc14


Having checked the above rows in the raw data file, I can confirm that these rows do not have Q_err. (There is nothing wrong with how the code read in the values.)

In [63]:
df_glitch_ATNF[df_glitch_ATNF['T_d'].notnull() & df_glitch_ATNF['T_d_err'].isnull()]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
394,B1737-30,J1740-3015,52347.66,0.06,152.0,2.0,-4.6,0.4,0.103,0.009,50.0,,zwm+08
399,B1737-30,J1740-3015,53023.52,0.0,1850.9,0.3,2.4,0.4,0.0302,0.0006,100.0,,"elsk11,zwm+08"


Having checked the above rows in the raw data file, I can confirm that these rows do not have T_d_err. (There is nothing wrong with how the code read in the values.)

#### Checking for NaNs and negative dF_F

In [66]:
df_glitch_ATNF[df_glitch_ATNF['dF_F'].isnull()]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
254,B1338-62,J1341-6220,58178.0,15.0,,,,,,,,,ljd+21
255,B1338-62,J1341-6220,58214.0,4.0,,,,,,,,,ljd+21


In [67]:
df_glitch_ATNF = df_glitch_ATNF.dropna(subset=['dF_F']).reset_index(drop=True)

In [68]:
df_glitch_ATNF[df_glitch_ATNF['dF_F'] <= 0]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
291,J1522-5735,J1522-5735,55250.0,,-11.4,0.6,-1.2,1.3,1.4,0.2,27.0,5.0,pga+13
636,J2301+5852,J2301+5852,56035.0,2.0,-310.0,40.0,2700.0,200.0,,,,,akn+13


In [69]:
df_glitch_ATNF = df_glitch_ATNF[df_glitch_ATNF['dF_F'] >= 0].reset_index(drop=True)

#### Summary after cleaning

In [71]:
df_glitch_ATNF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pulsar_name  640 non-null    string 
 1   J_name       640 non-null    string 
 2   MJD          640 non-null    Float64
 3   MJD_err      595 non-null    Float64
 4   dF_F         640 non-null    Float64
 5   dF_F_err     634 non-null    Float64
 6   dF1_F1       568 non-null    Float64
 7   dF1_F1_err   567 non-null    Float64
 8   Q            135 non-null    Float64
 9   Q_err        133 non-null    Float64
 10  T_d          138 non-null    Float64
 11  T_d_err      136 non-null    Float64
 12  references   640 non-null    string 
dtypes: Float64(10), string(3)
memory usage: 71.4 KB


In [72]:
df_glitch_ATNF['J_name'].nunique()

210

After cleaning, we have 640 entries (each glitch recovery component is one entry) from 210 unique pulsars. As we saw earlier, 18 of these entries are due to extra recovery components, so the ATNF Glitch Catalogue actually contains 622 glitches from 210 unique pulsars. 

Note: If we count the unique number of combinations of 'J_name' and 'MJD', we actually get 621. This is because there is one glitch, from J1801-2451 that has been entered twice (presumably from two different groups/analyses). For our purposes here, we will say that these two entries correspond to just one glitch.

In [74]:
df_glitch_ATNF[['J_name', 'MJD']].drop_duplicates().shape[0]

621

In [75]:
df_glitch_ATNF[df_glitch_ATNF[['J_name', 'MJD']].duplicated()] # Displays duplicated entries

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
36,B0531+21,J0534+2200,42447.26,0.04,35.7,0.3,1.6,0.1,0.536,0.012,97.0,4.0,lps93
38,B0531+21,J0534+2200,46663.69,0.03,6.0,1.0,0.5,0.1,0.89,0.09,123.0,40.0,lps93
133,B0833-45,J0835-4510,40280.0,4.0,2338.0,9.0,10.1,0.3,0.01782,5e-05,120.0,6.0,cdk88
135,B0833-45,J0835-4510,41192.0,8.0,2047.0,30.0,14.8,0.2,0.01311,9e-05,94.0,5.0,cdk88
138,B0833-45,J0835-4510,42683.0,3.0,1987.0,8.0,11.0,1.0,0.003534,1.6e-05,35.0,2.0,cdk88
140,B0833-45,J0835-4510,43693.0,12.0,3063.0,65.0,18.3,0.2,0.01134,2e-05,75.0,3.0,cdk88
142,B0833-45,J0835-4510,44888.4,0.4,1138.0,9.0,8.43,0.06,0.0019,4e-05,14.0,2.0,cdk88
144,B0833-45,J0835-4510,45192.1,0.5,2051.0,3.0,23.1,0.3,0.0055,8e-05,21.5,2.0,cdk88
146,B0833-45,J0835-4510,46259.0,2.0,1598.5,1.5,13.7,1.1,0.1541,0.0006,332.0,10.0,mkhr87
148,B0833-45,J0835-4510,47519.8036,8e-05,1805.2,0.8,77.0,6.0,0.1684,0.0004,351.0,1.0,mhmk90


In [76]:
df_glitch_ATNF[(df_glitch_ATNF['J_name'] == 'J1801-2451') & (df_glitch_ATNF['MJD'] == 54661)]

Unnamed: 0,pulsar_name,J_name,MJD,MJD_err,dF_F,dF_F_err,dF1_F1,dF1_F1_err,Q,Q_err,T_d,T_d_err,references
422,B1757-24,J1801-2451,54661.0,2.0,3101.0,1.0,9.3,0.1,0.0064,0.0009,25.0,4.0,"elsk11,ymh+13"
437,B1757-24,J1801-2451,54661.0,2.0,3083.7,0.7,6.5,0.5,,,,,ljd+21


Moreover, we have 135 entries that have a Q value, coming from 117 glitches of 60 unique pulsars.

In [78]:
df_glitch_ATNF[df_glitch_ATNF['Q'].notnull()].nunique()

pulsar_name     60
J_name          60
MJD            115
MJD_err         47
dF_F           117
dF_F_err        42
dF1_F1          86
dF1_F1_err      38
Q              112
Q_err           54
T_d            116
T_d_err         63
references      43
dtype: int64

### ATNF Pulsar Catalogue

#### Setting correct data types

In [81]:
df_pulsars = df_pulsars.convert_dtypes() # Without this line, we only get Float64 columns. We get both Float64 and Int64 with it. All 'None' values are set to pd.NA.
df_pulsars = df_pulsars.apply(pd.to_numeric, errors='ignore') # Converts columns to appropriate data type, if not numeric, column stays as string

  df_pulsars = df_pulsars.apply(pd.to_numeric, errors='ignore') # Converts columns to appropriate data type, if not numeric, column stays as string


In [82]:
df_pulsars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4351 entries, 0 to 4350
Columns: 162 entries, A1 to W50
dtypes: Float64(140), Int64(8), string(14)
memory usage: 6.0 MB


We see that we initially have 140 columns that are floats, 8 that are integers and 14 that are strings. We will check the integer and string columns to ensure they do not have any columns that should be floats.

In [84]:
int_columns = df_pulsars.select_dtypes(include='int')
int_columns[int_columns.notnull().any(axis = 1)] # Shows only non-null rows

Unnamed: 0,NGLT,S35,S40,S50,S60,S64,S79,S80
2,1,,,,,,,
5,,,,43,,,,
37,,,,86,,,,
41,,,,900,,,,
42,,,,600,,,,5000
...,...,...,...,...,...,...,...,...
4321,1,,,,,,,
4329,,,,111,,,,250
4334,1,,,,,,,
4339,1,,,,,,,


The 8 columns consist of the number of glitches (NGLT) and the mean fluxes (in mJy) at several frequencies (in MHz). To be consistent with the other mean fluxes, e.g. S400, S1400, S2000, we will convert the mean flux columns to floats.

In [86]:
int_to_float_col = int_columns.loc[:, int_columns.columns.str.startswith('S')].columns # Getting the names of the integer columns beginning with S
df_pulsars[int_to_float_col] = df_pulsars[int_to_float_col].astype('Float64')

We will now look at the string columns.

In [88]:
str_columns = df_pulsars.select_dtypes(include='string')
str_columns

Unnamed: 0,ASSOC,BINARY,BINCOMP,CLK,DECJ,EPHEM,PSRB,PSRJ,RAJ,SURVEY,TYPE,UNITS,W10,W50
0,"GRS:4FGL_J0002.8+6217[aab+22],XRS:1XSPS_J00025...",,,,+62:16:09.4,,,J0002+6216,00:02:58.17,FermiBlind,HE[wcp+18],,,
1,,,,,+18:34:59,,,J0006+1834,00:06:04.8,"ar4,ar327,fast_gpps",,,112.1,61.3
2,"GRS:4FGL_J0007.0+7303[aab+22],XRS:RX_J0007.0+7...",,,,+73:03:07.4,,,J0007+7303,00:07:01.7,FermiBlind,NRAD[aab+22],,,
3,,,,,+08:10,,,J0011+08,00:11:34,"ar327,fast_gpps",,,53,13
4,,,,TT(BIPM2019),+54:31:40,DE440,,J0012+5431,00:12:23.3,chime,RRAT[dcm+23],,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4346,,,,,-22:51:53,,,J2354-2250,23:54:26,"htru_pks,gbncc",,,21,9
4347,,ELL1,He[mzl+23],,+00:51:09.57,DE438,,J2355+0051,23:55:51.2885,"fast_crafts,ar327",,TDB,1.7,0.5
4348,,,,,04:43,,,J2355+04,23:55:30,pumps,,,,
4349,,,,TT(BIPM2019),+15:23:19,DE440,,J2355+1523,23:55:48.62,chime,RRAT[dcm+23],,,


From the string columns above, we see that W10 and W50 should both be floats. Also, it would be better to express the declination (DECJ) and right ascension (RAJ) as an angle rather than in dd:mm:ss (degrees, minutes, seconds) and hh:mm:ss (hours, minutes, seconds), respectively.

First, we change W10 and W50 columns to floats. The underlying issue was that pd.to_numeric did not convert strings in scientific notation, e.g. '1.3e+02', or perhaps it was struggling with the string 'nan', so we've had to use .astype('Float64') instead.

In [90]:
df_pulsars[['W10', 'W50']] = df_pulsars[['W10', 'W50']].astype('Float64')
df_pulsars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4351 entries, 0 to 4350
Columns: 162 entries, A1 to W50
dtypes: Float64(149), Int64(1), string(12)
memory usage: 6.0 MB


#### Converting DECJ and RAJ columns to degrees (DEC and RA)

We will now write two functions to convert DECJ and RAJ into degrees:

In [93]:
def convert_decj_to_deg(decj): # decj is a string in the format 'dd:mm:ss' or 'dd:mm' or 'mm' with a plus, minus or nothing in front
    dec_deg = []
    if pd.notna(decj):
        split_string = decj.split(':')
        if split_string[0][0] == '+': # If first character is a plus symbol
            split_string[0] = split_string[0][1:] # Remove the plus symbol
        float_string = list(map(float, split_string))

        degrees = abs(float_string[0])
        arcminutes = float_string[1] if len(float_string) > 1 else 0
        arcseconds = float_string[2] if len(float_string) > 2 else 0

        angle = degrees + (arcminutes/60.0) + (arcseconds/3600.0)

        if float_string[0] < 0:
            angle = -angle
        
        dec_deg.append(angle)
    else:
        dec_deg.append(pd.NA)

    return dec_deg[0]

In [94]:
df_pulsars['DEC'] = df_pulsars['DECJ'].apply(convert_decj_to_deg).astype('Float64')

In [95]:
df_pulsars[['DECJ', 'DEC']]

Unnamed: 0,DECJ,DEC
0,+62:16:09.4,62.269278
1,+18:34:59,18.583056
2,+73:03:07.4,73.052056
3,+08:10,8.166667
4,+54:31:40,54.527778
...,...,...
4346,-22:51:53,-22.864722
4347,+00:51:09.57,0.852658
4348,04:43,4.716667
4349,+15:23:19,15.388611


In [96]:
def convert_raj_to_deg(raj): # raj is a string in the format 'hh:mm:ss' or 'hh:mm'
    raj_deg = []
    if pd.notna(raj):
        split_string = raj.split(':')
        float_string = list(map(float, split_string))

        hours = float_string[0]
        minutes = float_string[1]
        seconds = float_string[2] if len(float_string) > 2 else 0

        angle = hours * (360.0/24.0) + minutes * (360.0/(24.0 * 60.0)) + seconds * (360.0/(24.0 * 60.0 * 60.0))

        raj_deg.append(angle)
    else:
        raj_deg.append(pd.NA)
        
    return raj_deg[0]

In [97]:
df_pulsars['RA'] = df_pulsars['RAJ'].apply(convert_raj_to_deg).astype('Float64')

In [98]:
df_pulsars[['RAJ', 'RA']]

Unnamed: 0,RAJ,RA
0,00:02:58.17,0.742375
1,00:06:04.8,1.52
2,00:07:01.7,1.757083
3,00:11:34,2.891667
4,00:12:23.3,3.097083
...,...,...
4346,23:54:26,358.608333
4347,23:55:51.2885,358.963702
4348,23:55:30,358.875
4349,23:55:48.62,358.952583


## Combining datasets

### Comparing JBCA and ATNF Glitch Catalogues

#### Pulsars in both databases

In [102]:
s1 = set(df_glitch_JBCA['J_name'])
s2 = set(df_glitch_ATNF['J_name'])
s3 = set(df_pulsars['PSRJ'])

In [103]:
pulsars_in_both = s1 & s2
len(pulsars_in_both)

189

#### Pulsars in JBCA but missing from ATNF

In [105]:
in_JBCA_not_ATNF = s1 - s2
len(in_JBCA_not_ATNF)

34

In [106]:
in_JBCA_not_ATNF

{'J0040-7335',
 'J0040-7337',
 'J0048-7317',
 'J0726-2612',
 'J0738-4042',
 'J0855-3331',
 'J0955+6940',
 'J1048-5937',
 'J1341-6023',
 'J1647-4552',
 'J1730-3353',
 'J1809-0119',
 'J1821-1419',
 'J1828-1101',
 'J1832+0029',
 'J1835-0024',
 'J1838-0537',
 'J1838-0655',
 'J1843-0509',
 'J1844-0310',
 'J1849-0001',
 'J1849-0636',
 'J1854-1557',
 'J1904-1629',
 'J1907+0602',
 'J1914+1122',
 'J1935+1616',
 'J1935+2025',
 'J1948+2819',
 'J1949-2524',
 'J1955+2529',
 'J2004+3427',
 'J2022+2854',
 'J2111+4606'}

Appendix C of Yim, Shao & Xu (2024) found 19 pulsars, 17 of which are identified above. 2 that are no longer found have been corrected by JBCA (J1635-2614 --> J1636-2614, M82-X2 --> J0955+6940). That means that there are 17 new pulsars that have been added to the JBCA Glitch Catalogue since July 2024 (when the paper was published) that haven't been added to the ATNF Glitch Catalogue.

#### Pulsars in ATNF but missing from JBCA

In [108]:
in_ATNF_not_JBCA = s2 - s1
len(in_ATNF_not_JBCA)

21

In [109]:
in_ATNF_not_JBCA

{'J0908-4913',
 'J0954-5430',
 'J1015-5719',
 'J1050-5953',
 'J1141-6545',
 'J1422-6138',
 'J1550-5418',
 'J1602-5100',
 'J1645-0317',
 'J1703-4851',
 'J1706-4434',
 'J1722-3632',
 'J1822-1604',
 'J1844-0346',
 'J1852-0635',
 'J1906+0722',
 'J1910+1026',
 'J1915+1150',
 'J1939+2609',
 'J1947+1957',
 'J1954+2529'}

The above pulsars are exactly the same as what was identified in Appendix C of Yim, Shao & Xu (2024) except for one that is missing. The missing pulsar is J1636-2614 which has since been corrected by JBCA (J1635-2614 --> J1636-2614), making it consistent with the ATNF databases.

#### Pulsars in JBCA Glitch Catalogue but not in ATNF Pulsar Catalogue

In [111]:
in_JBCA_not_pulsars = s1 - s3
len(in_JBCA_not_pulsars)

6

In [112]:
in_JBCA_not_pulsars

{'J0955+6940',
 'J1048-5937',
 'J1835-0024',
 'J1843-0509',
 'J1955+2529',
 'J2004+3427'}

These pulsars have not had their information added to the ATNF Pulsar Catalogue yet.

#### Pulsars in ATNF Glitch Catalogue but not in ATNF Pulsar Catalogue

In [114]:
in_ATNF_not_pulsars = s2 - s3
len(in_ATNF_not_pulsars)

0

There are no pulsars in the ATNF Glitch Catalogue which is not already in the ATNF Pulsar Catalogue, which was to be expected.