# Removing duplicates from SwiftDS-Stripe82

We have produced a catalog using the Swift DeepSky pipeline, where all Swift observations inside the Stripe82 were used.

Let's now filter the duplicated entries and have a look on it.

In [1]:
swift_file = 'table_countrates_stripe82_all_swift_pointings.csv'

In [7]:
import pandas as pd

df = pd.read_csv(swift_file, sep=';')

In [9]:
df.head()

Unnamed: 0,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1)
0,00:56:24.480,-01:16:38.317,0.005645,0.0015,4572.8,0.000868,0.0006,-999.0,0.00304,0.001125,-999.0,0.001737,0.000853,-999.0
1,00:56:19.136,-01:14:58.198,0.004449,0.0014,4684.2,0.000494,0.000485,-999.0,0.001977,0.000959,-999.0,0.001977,0.000959,-999.0
2,00:56:23.004,-01:13:39.516,0.003901,0.0014,4649.6,0.000867,0.000679,-999.0,0.001301,0.00084,-999.0,0.001734,0.000969,-999.0
3,00:56:16.418,-01:14:05.418,0.005795,0.0016,4642.0,0.000446,0.00046,-999.0,0.002674,0.0011,-999.0,0.002674,0.0011,-999.0
4,00:56:16.922,-01:13:16.401,0.005224,0.0015,4587.3,0.002177,0.000937,-999.0,0.001306,0.000741,-999.0,0.001741,0.000853,-999.0


In [10]:
from astropy.coordinates import SkyCoord
from astropy import units

swift_coords = SkyCoord(df.RA, df.DEC, unit=(units.hourangle,units.degree))
df['RA'] = swift_coords.ra.deg
df['DEC'] = swift_coords.dec.deg
df.head()



Unnamed: 0,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1)
0,14.102,-1.27731,0.005645,0.0015,4572.8,0.000868,0.0006,-999.0,0.00304,0.001125,-999.0,0.001737,0.000853,-999.0
1,14.079733,-1.249499,0.004449,0.0014,4684.2,0.000494,0.000485,-999.0,0.001977,0.000959,-999.0,0.001977,0.000959,-999.0
2,14.09585,-1.227643,0.003901,0.0014,4649.6,0.000867,0.000679,-999.0,0.001301,0.00084,-999.0,0.001734,0.000969,-999.0
3,14.068408,-1.234838,0.005795,0.0016,4642.0,0.000446,0.00046,-999.0,0.002674,0.0011,-999.0,0.002674,0.0011,-999.0
4,14.070508,-1.221223,0.005224,0.0015,4587.3,0.002177,0.000937,-999.0,0.001306,0.000741,-999.0,0.001741,0.000853,-999.0


## xmatch using 'gc-filtering' algorithm

Let's remove the duplicates by matching the catalog with itself and the nearby matches should be removed according to their SNR.

### SNR column

Let's define the objects' overall SNR estimate to be the ratio between columns `countrates_0.3-10keV(ph.s-1)` and `countrates_error_0.3-10keV(ph.s-1)`, which are the countrate and error associated to the full band emission.

In [12]:
df['snr'] = df['countrates_0.3-10keV(ph.s-1)']/df['countrates_error_0.3-10keV(ph.s-1)']

**xmatch** needs columns 'ra','dec','id' to be defined. As well as the search radius.

In [13]:
df.reset_index(inplace=True)
df.rename(columns={'index':'ID'}, inplace=True)

In [14]:
df.head()

Unnamed: 0,ID,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1),snr
0,0,14.102,-1.27731,0.005645,0.0015,4572.8,0.000868,0.0006,-999.0,0.00304,0.001125,-999.0,0.001737,0.000853,-999.0,3.763333
1,1,14.079733,-1.249499,0.004449,0.0014,4684.2,0.000494,0.000485,-999.0,0.001977,0.000959,-999.0,0.001977,0.000959,-999.0,3.177857
2,2,14.09585,-1.227643,0.003901,0.0014,4649.6,0.000867,0.000679,-999.0,0.001301,0.00084,-999.0,0.001734,0.000969,-999.0,2.786429
3,3,14.068408,-1.234838,0.005795,0.0016,4642.0,0.000446,0.00046,-999.0,0.002674,0.0011,-999.0,0.002674,0.0011,-999.0,3.621875
4,4,14.070508,-1.221223,0.005224,0.0015,4587.3,0.002177,0.000937,-999.0,0.001306,0.000741,-999.0,0.001741,0.000853,-999.0,3.482667


In [15]:
from astropy import units
radius = 6 * units.arcsec

cols = {'ra':'RA', 'dec':'DEC', 'id':'ID'}

In [16]:
from xmatch import xmatch

help(xmatch)

Help on function xmatch in module xmatch.xmatchi:

xmatch(catalog_A, catalog_B, columns_A=None, columns_B=None, radius=None, separation_unit='arcsec', method='gc', parallel=False, nprocs=None, snr_column=None)
    Input:
     - catalog_A, catalog_B : ~pandas.DataFrame
             DFs containing (at least) the columns 'ra','dec','id'
     - columns_A, columns_B : dict mapping 'ra','dec','id' columns
            In case catalog(s) have different column names for 'ra','dec','id';
            e.g, {'ra':'RA', 'dec':'Dec', 'id':'ObjID'}
    
    Output:
     - matched_catalog : ~pandas.DataFrame



In [17]:
xcat = xmatch(df, df, columns_A=cols, columns_B=cols, radius=radius, snr_column='snr')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [19]:
print(len(xcat))
xcat.head(20)

2743


Unnamed: 0_level_0,A,A,A,B,B,B,AB,AB,AB
Unnamed: 0_level_1,RA,DEC,ID,RA,DEC,ID,snr,duplicates,snrs
0,14.102,-1.27731,0,14.102,-1.27731,0,3.763333,,
1,14.079733,-1.249499,1,14.079733,-1.249499,1,3.177857,,
2,14.09585,-1.227643,2,14.09585,-1.227643,2,2.786429,,
3,14.068408,-1.234838,3,14.068408,-1.234838,3,3.621875,,
4,14.070508,-1.221223,4,14.070508,-1.221223,4,3.482667,,
5,14.055662,-1.248697,5,14.055662,-1.248697,5,3.100714,,
6,14.107787,-1.356168,6,14.107787,-1.356168,6,2.6,,
7,14.012692,-1.239376,7,14.012692,-1.239376,7,2.72,,
8,356.169054,-0.202381,8,356.169054,-0.202381,8,5.2075,,
9,356.186058,-0.112661,9,356.186058,-0.112661,9,3.781319,,


In [35]:
pcat = df.set_index('ID').loc[xcat[('B','ID')]]

In [36]:
print(len(pcat))
pcat.head(20)

2743


Unnamed: 0_level_0,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1),snr
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,14.102,-1.27731,0.005645,0.0015,4572.8,0.000868,0.0006,-999.0,0.00304,0.001125,-999.0,0.001737,0.000853,-999.0,3.763333
1,14.079733,-1.249499,0.004449,0.0014,4684.2,0.000494,0.000485,-999.0,0.001977,0.000959,-999.0,0.001977,0.000959,-999.0,3.177857
2,14.09585,-1.227643,0.003901,0.0014,4649.6,0.000867,0.000679,-999.0,0.001301,0.00084,-999.0,0.001734,0.000969,-999.0,2.786429
3,14.068408,-1.234838,0.005795,0.0016,4642.0,0.000446,0.00046,-999.0,0.002674,0.0011,-999.0,0.002674,0.0011,-999.0,3.621875
4,14.070508,-1.221223,0.005224,0.0015,4587.3,0.002177,0.000937,-999.0,0.001306,0.000741,-999.0,0.001741,0.000853,-999.0,3.482667
5,14.055662,-1.248697,0.004341,0.0014,4462.9,0.00217,0.001002,-999.0,0.001627,0.000872,-999.0,0.000543,0.000506,-999.0,3.100714
6,14.107787,-1.356168,0.00312,0.0012,4736.2,0.00156,0.00088,-999.0,0.00117,-1.199,1.101,0.00156,0.00088,-999.0,2.6
7,14.012692,-1.239376,0.003536,0.0013,4647.6,0.00101,0.000707,-999.0,0.002526,0.001138,-999.0,0.001768,-1.299,0.9447,2.72
8,356.169054,-0.202381,0.006249,0.0012,4881.5,0.001785,0.000657,-999.0,0.002678,0.000776,-999.0,0.001785,0.000657,-999.0,5.2075
9,356.186058,-0.112661,0.003441,0.00091,5492.7,0.001147,0.000538,-999.0,0.000382,0.000306,-999.0,0.001911,0.000695,-999.0,3.781319


In [37]:
pcat.reset_index(inplace=True)
del pcat['snr']

In [39]:
pcat.sample(10)

Unnamed: 0,ID,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1)
628,796,41.0029,0.790174,0.001332,0.00043,12665.4,0.000381,0.000231,-999.0,0.000571,0.00028,-999.0,0.000381,0.000231,-999.0
2071,3634,58.441383,-0.186862,0.000304,8.7e-05,83203.3,6.5e-05,4e-05,-999.0,0.000131,5.7e-05,-999.0,0.000109,5.2e-05,-999.0
546,2622,57.110108,-0.948431,0.000321,8.9e-05,89883.4,0.000161,6.1e-05,-999.0,9.6e-05,4.7e-05,-999.0,6.4e-05,3.9e-05,-999.0
1988,3302,322.567979,-0.701916,0.008196,0.0034,1201.3,0.008196,0.0034,-999.0,0.0,-3.397,0.9446,0.0,-3.397,0.9446
1725,2672,356.861917,0.387546,0.000279,8.6e-05,85452.5,7e-05,4.3e-05,-999.0,0.000139,6e-05,-999.0,7e-05,4.3e-05,-999.0
521,558,27.132512,1.382409,0.001869,0.00073,5178.0,0.001869,0.00073,-999.0,0.000479,-0.7293,0.8266,0.000335,-0.7293,0.8266
877,1075,46.251325,-1.098815,0.007441,0.0007,9333.7,0.00186,0.000364,-999.0,0.004651,0.000574,-999.0,0.00093,0.000259,-999.0
12,12,356.259583,-0.200416,0.003065,0.0009,5003.4,0.000383,0.000324,-999.0,0.001916,0.00072,-999.0,0.000766,0.000462,-999.0
1350,1762,40.586292,-0.349301,0.003121,0.0005,17760.4,0.000503,0.0002,-999.0,0.000906,0.000269,-999.0,0.001712,0.000369,-999.0
1893,3063,20.611,0.934171,0.001548,0.00035,19181.2,0.000489,0.000197,-999.0,0.000652,0.000226,-999.0,0.000407,0.000175,-999.0


In [28]:
pcat.describe()

Unnamed: 0,ID,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1)
count,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0,2743.0
mean,2596.637988,126.358611,-0.105252,0.010052,0.000792,35114.480496,0.004289,-0.057154,-1015.284631,0.003434,-0.117692,-1007.621891,0.002732,-0.165562,-979.186074
std,1796.110941,139.928501,0.788274,0.09076,0.001327,49079.959543,0.037934,0.310206,4279.634511,0.030661,0.782409,4280.553107,0.025583,0.864109,4283.837361
min,0.0,0.049904,-1.639452,0.000119,3.6e-05,387.5,0.0,-5.195,-224700.0,0.0,-13.99,-224700.0,0.0,-13.99,-224700.0
25%,1136.5,29.5383,-0.7924,0.000853,0.00019,7164.65,0.00022,8.2e-05,-999.0,0.000246,8.2e-05,-999.0,0.000222,7.1e-05,-999.0
50%,2298.0,44.141213,-0.159177,0.001715,0.00043,14511.1,0.00055,0.000209,-999.0,0.00056,0.00021,-999.0,0.000499,0.00018,-999.0
75%,3972.5,320.833771,0.52032,0.003691,0.00083,41103.05,0.001397,0.000468,-999.0,0.001332,0.000443,-999.0,0.001103,0.000381,-999.0
max,6859.0,359.917396,1.606157,2.716,0.019,295994.0,1.091,0.014,6.601,0.9114,0.01229,6.62,0.9446,0.0076,6.62


## Plot: countrates .vs. exposure-time

In [26]:
from bokeh.plotting import figure
from bokeh.io import output_notebook, show
output_notebook()

In [78]:
import numpy as np

p = figure(y_axis_type="log", x_axis_type="log")

p.circle(df['exposure_time(s)'], df['countrates_0.3-10keV(ph.s-1)'])

show(p)

## Plot: exposure-time

In [31]:
%matplotlib notebook

df['exposure_time(s)'].hist()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7fbbba740a90>

## Plot: countrates

In [74]:
import seaborn as sns

ax = sns.distplot(df['countrates_0.3-10keV(ph.s-1)'], kde=False)
ax.set_yscale('log')

<IPython.core.display.Javascript object>

In [75]:
pileup = sum(df['countrates_0.3-10keV(ph.s-1)'] > 0.5)
print('Number of pile-ups: ', pileup)

Number of pile-ups:  24


Pile-up is when two photons arrive at the same CCD frame and the instrument reads that out as a single very energetic photon.

A countrate value above $0.5$ should raise an eyebrow as pile-up is probably in place.
* http://www.swift.ac.uk/analysis/xrt/pileup.php
* https://arxiv.org/abs/astro-ph/0701815

Potencially piled-up sources -- 24 in total -- are being removed from our sample.

In [79]:
idx_pu = df['countrates_0.3-10keV(ph.s-1)'] > 0.5
df = df.loc[~idx_pu]

df.describe()

Unnamed: 0,ID,RA,DEC,countrates_0.3-10keV(ph.s-1),countrates_error_0.3-10keV(ph.s-1),exposure_time(s),countrates_0.3-1keV(ph.s-1),countrates_error_0.3-1keV(ph.s-1),upper_limit_0.3-1keV(ph.s-1),countrates_1-2keV(ph.s-1),countrates_error_1-2keV(ph.s-1),upper_limit_1-2keV(ph.s-1),countrates_2-10keV(ph.s-1),countrates_error_2-10keV(ph.s-1),upper_limit_2-10keV(ph.s-1),snr
count,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0,6863.0
mean,3442.496284,116.464649,-0.200815,0.005234,0.000597,46730.938555,0.002119,-0.04457,-1006.485318,0.001881,-0.066753,-1012.745902,0.001474,-0.101662,-988.847759,6.080724
std,1987.694049,133.397136,0.788142,0.023175,0.000957,55957.695837,0.010527,0.259465,3826.611306,0.008644,0.530089,3825.776333,0.006036,0.606486,3828.910285,8.057549
min,0.0,0.049904,-1.639452,0.000119,3.6e-05,387.5,0.0,-5.195,-224700.0,0.0,-13.99,-224700.0,0.0,-13.99,-224700.0,2.040541
25%,1721.5,31.286925,-0.862992,0.000663,0.00014,9321.0,0.000168,5.9e-05,-999.0,0.000199,6.6e-05,-999.0,0.000188,6.2e-05,-999.0,3.287234
50%,3442.0,45.260975,-0.277841,0.001435,0.00032,21131.0,0.000428,0.000147,-999.0,0.000464,0.000158,-999.0,0.000409,0.000137,-999.0,4.28
75%,5163.5,310.090387,0.40573,0.003022,0.00067,68066.65,0.001084,0.000357,-999.0,0.001101,0.000365,-999.0,0.000926,0.000312,-999.0,6.192391
max,6886.0,359.917396,1.606157,0.4392,0.014,295994.0,0.2078,0.014,6.601,0.2263,0.008,6.62,0.1435,0.006848,6.62,189.384615
